Python熊猫张量访问非常缓慢

我创建了一个巨大的张量，数百万的字三元组及其计数。例如，一个字三联是一个(word0, link, word1)。这些字三元组收集在单个字典中，其值是它们各自的计数，例如， (word0, link, word1): 15。想象一下，我有几百万这样的三元组。在我计算出现之后，我尝试做其他计算，这是我的Python脚本卡住的地方。下面是代码的一部分，这需要永恒：Python熊猫张量访问非常缓慢

big_tuple = covert_to_tuple(big_dict) 
pdf = pd.DataFrame.from_records(big_tuple) 
pdf.columns = ['word0', 'link', 'word1', 'counts'] 
total_cnts = pdf.counts.sum() 

for _, row in pdf.iterrows(): 
    w0, link, w1 = row['word0'], row['link'], row['word1'] 
    w0w1_link = row.counts 

    # very slow 
    w0_link = pdf[(pdf.word0 == w0) & (pdf.link == link)]['counts'].sum() 
    w1_link = pdf[(pdf.word1 == w1) & (pdf.link == link)]['counts'].sum() 

    p_w0w1_link = w0w1_link/total_cnts 
    p_w0_link = w0_link/total_cnts 
    p_w1_link = w1_link/total_cnts 
    new_score = log(p_w0w1_link/(p_w0_link * p_w1_link)) 
    big_dict[(w0, link, w1)] = new_score

我异型我的剧本，似乎下面

w0_link = pdf[(pdf.word0 == w0) & (pdf.link == link)]['counts'].sum() 
w1_link = pdf[(pdf.word1 == w1) & (pdf.link == link)]['counts'].sum()

两条线走49％和49％的计算时间％的每。这些行试图找到(word0, link)和(word1, link)的计数。所以，看起来像pdf像这样访问需要很多时间？我可以做些什么来优化它吗？

来源

2016-05-19 minerals

请检查我的更新答案 - 我想理解为什么new_score的表达不正确。 – knagaev

啊，是的，你说得对。数学... :) – minerals

正是:)消除计算开销。 – knagaev

请检查我的解决方案 - 我在优化计算的东西（希望没有错误:)）

# sample of data 
df = pd.DataFrame({'word0': list('aabb'), 'link': list('llll'), 'word1': list('cdcd'),'counts': [10, 20, 30, 40]}) 

# caching total count 
total_cnt = df['counts'].sum() 

# two series with sums for all combinations of ('word0', 'link') and ('word1', 'link') 
grouped_w0_l = df.groupby(['word0', 'link'])['counts'].sum()/total_cnt 
grouped_w1_l = df.groupby(['word1', 'link'])['counts'].sum()/total_cnt 

# join sums for grouped ('word0', 'link') to original df 
merged_w0 = df.set_index(['word0', 'link']).join(grouped_w0_l, how='left', rsuffix='_w0').reset_index() 

# join sums for grouped ('word1', 'link') to merged df 
merged_w0_w1 = merged_w0.set_index(['word1', 'link']).join(grouped_w1_l, how='left', rsuffix='_w1').reset_index() 

# merged_w0_w1 has enough data for calculation new_score 
# check here - I transform the expression 
merged_w0_w1['new_score'] = np.log(merged_w0_w1['counts'] * total_cnt/(merged_w0_w1['counts_w0'] * merged_w0_w1['counts_w1'])) 

# export results to dict (don't know is it really needed or not - you can continue manipulate data with dataframes) 
big_dict = merged_w0_w1.set_index(['word0', 'link', 'word1'])['new_score'].to_dict()

为NEW_SCORE表达是

new_score = log(p_w0w1_link/(p_w0_link * p_w1_link)) 
     = log(w0w1_link/total_cnts/(w0_link/total_cnts * w0_link/total_cnts)) 
     = log(w0w1_link/total_cnts * (total_cnts * total_cnts/w0_link * w0_link)) 
     = log(w0w1_link * total_cnts/(w0_link * w0_link))

来源

2016-05-19 12:30:00 knagaev

非常感谢你的努力，我会尽快检查 – minerals

好的，我通过你的例子，加入表确实快得多。我认为你的数学计算是不同的，这里'merged_w0_w1 ['counts'] * total_cnt'，这应该是除法，而不是乘法，因为我们计算的是“三重”的概率。剩下的就完美了！ – minerals

Python熊猫张量访问非常缓慢

回答

相关问题