2016-05-19 34 views
1

我创建了一个巨大的张量,数百万的字三元组及其计数。例如,一个字三联是一个(word0, link, word1)。这些字三元组收集在单个字典中,其值是它们各自的计数,例如, (word0, link, word1): 15。想象一下,我有几百万这样的三元组。在我计算出现之后,我尝试做其他计算,这是我的Python脚本卡住的地方。下面是代码的一部分,这需要永恒:Python熊猫张量访问非常缓慢

big_tuple = covert_to_tuple(big_dict) 
pdf = pd.DataFrame.from_records(big_tuple) 
pdf.columns = ['word0', 'link', 'word1', 'counts'] 
total_cnts = pdf.counts.sum() 

for _, row in pdf.iterrows(): 
    w0, link, w1 = row['word0'], row['link'], row['word1'] 
    w0w1_link = row.counts 

    # very slow 
    w0_link = pdf[(pdf.word0 == w0) & (pdf.link == link)]['counts'].sum() 
    w1_link = pdf[(pdf.word1 == w1) & (pdf.link == link)]['counts'].sum() 

    p_w0w1_link = w0w1_link/total_cnts 
    p_w0_link = w0_link/total_cnts 
    p_w1_link = w1_link/total_cnts 
    new_score = log(p_w0w1_link/(p_w0_link * p_w1_link)) 
    big_dict[(w0, link, w1)] = new_score 

我异型我的剧本,似乎下面

w0_link = pdf[(pdf.word0 == w0) & (pdf.link == link)]['counts'].sum() 
w1_link = pdf[(pdf.word1 == w1) & (pdf.link == link)]['counts'].sum() 

两条线走49%49%的计算时间%的每。这些行试图找到(word0, link)(word1, link)的计数。所以,看起来像pdf像这样访问需要很多时间?我可以做些什么来优化它吗?

+0

请检查我的更新答案 - 我想理解为什么new_score的表达不正确。 – knagaev

+0

啊,是的,你说得对。数学... :) – minerals

+0

正是:)消除计算开销。 – knagaev

回答

2

请检查我的解决方案 - 我在优化计算的东西(希望没有错误:))

# sample of data 
df = pd.DataFrame({'word0': list('aabb'), 'link': list('llll'), 'word1': list('cdcd'),'counts': [10, 20, 30, 40]}) 

# caching total count 
total_cnt = df['counts'].sum() 

# two series with sums for all combinations of ('word0', 'link') and ('word1', 'link') 
grouped_w0_l = df.groupby(['word0', 'link'])['counts'].sum()/total_cnt 
grouped_w1_l = df.groupby(['word1', 'link'])['counts'].sum()/total_cnt 

# join sums for grouped ('word0', 'link') to original df 
merged_w0 = df.set_index(['word0', 'link']).join(grouped_w0_l, how='left', rsuffix='_w0').reset_index() 

# join sums for grouped ('word1', 'link') to merged df 
merged_w0_w1 = merged_w0.set_index(['word1', 'link']).join(grouped_w1_l, how='left', rsuffix='_w1').reset_index() 

# merged_w0_w1 has enough data for calculation new_score 
# check here - I transform the expression 
merged_w0_w1['new_score'] = np.log(merged_w0_w1['counts'] * total_cnt/(merged_w0_w1['counts_w0'] * merged_w0_w1['counts_w1'])) 

# export results to dict (don't know is it really needed or not - you can continue manipulate data with dataframes) 
big_dict = merged_w0_w1.set_index(['word0', 'link', 'word1'])['new_score'].to_dict() 

为NEW_SCORE表达是

new_score = log(p_w0w1_link/(p_w0_link * p_w1_link)) 
     = log(w0w1_link/total_cnts/(w0_link/total_cnts * w0_link/total_cnts)) 
     = log(w0w1_link/total_cnts * (total_cnts * total_cnts/w0_link * w0_link)) 
     = log(w0w1_link * total_cnts/(w0_link * w0_link)) 
+0

非常感谢你的努力,我会尽快检查 – minerals

+0

好的,我通过你的例子,加入表确实快得多。我认为你的数学计算是不同的,这里'merged_w0_w1 ['counts'] * total_cnt',这应该是除法,而不是乘法,因为我们计算的是“三重”的概率。剩下的就完美了! – minerals