2015-11-04 77 views
0

我有一个单词列表和字典:迭代字典中的多个值?

word_list = ["it's","they're","there's","he's"] 

并作为在words_list的话如何频繁地出现在几个文件包含信息的字典:

dict = [('document1',{"it's": 0,"they're": 2,"there's": 5,"he's": 1}), 
('document2',{"it's": 4,"they're": 2,"there's": 3,"he's": 0}), 
('document3',{"it's": 7,"they're": 0,"there's": 4,"he's": 1})] 

我想开发一个数据结构(数据帧,也许?),看起来像如下:

file  word  count 
document1 it's  0 
document1 they're  2 
document1 there's  5 
document1 he's  1 
document2 it's  4 
document2 they're  2 
document2 there's  3 
document2 he's  0 
document3 it's  7 
document3 they're  0 
document3 there's  4 
document3 he's  1 

我试图找到这些文档中最常使用的是。我有900多个文件。

我在考虑类似如下:

res = {} 
for i in words_list: 
    count = 0 
    for j in dict.items(): 
     if i == j: 
       count = count + 1 
       res[i,j] = count 

我在哪里可以从这里走?

+0

这不是一个字典死心塌地的线条。 – user2357112

+0

您应该使用Python Pandas lib来创建您在帖子中显示的数据框的类型。 –

+0

我从哪里开始?我应该看的任何方法? – blacksite

回答

2

好第一件事情,你的字典是不是一个字典,并且现在应建设成为一个像这样

d = {'document1':{"it's": 0,"they're": 2,"there's": 5,"he's": 1}, 
    'document2':{"it's": 4,"they're": 2,"there's": 3,"he's": 0}, 
    'document3':{"it's": 7,"they're": 0,"there's": 4,"he's": 1}} 

有,我们实际上我们可以用大熊猫建立一个数据帧一本字典,而是在为了以你想要的方式获得它,我们将不得不从字典中建立一个列表清单。然后,我们将创建一个数据框和标记列,然后排序

import collections 
import pandas as pd 

d = {'document1':{"it's": 0,"they're": 2,"there's": 5,"he's": 1}, 
    'document2':{"it's": 4,"they're": 2,"there's": 3,"he's": 0}, 
    'document3':{"it's": 7,"they're": 0,"there's": 4,"he's": 1}} 

d = pd.DataFrame([[k,k1,v1] for k,v in d.items() for k1,v1 in v.items()], columns = ['File','Words','Count']) 
print d.sort(['File','Count'], ascending=[1,1]) 

     File Words Count 
1 document1  it's  0 
0 document1  he's  1 
3 document1 they're  2 
2 document1 there's  5 
4 document2  he's  0 
7 document2 they're  2 
6 document2 there's  3 
5 document2  it's  4 
11 document3 they're  0 
8 document3  he's  1 
10 document3 there's  4 
9 document3  it's  7 

如果你想与前n次出现,那么你可以使用groupby(),然后要么排序

d = d.sort(['File','Count'], ascending=[1,1]).groupby('File').head(2) 

     File Words Count 
1 document1  it's  0 
0 document1  he's  1 
4 document2  he's  0 
7 document2 they're  2 
11 document3 they're  0 
8 document3  he's  1 

head() or tail()列表理解返回名单列表,看起来像这样

d = [['document1', "he's", 1], ['document1', "it's", 0], ['document1', "there's", 5], ['document1', "they're", 2], ['document2', "he's", 0], ['document2', "it's", 4], ['document2', "there's", 3], ['document2', "they're", 2], ['document3', "he's", 1], ['document3', "it's", 7], ['document3', "there's", 4], ['document3', "they're", 0]] 

为了正确地建立字典,你只需要使用一些东西克

d['document1']['it\'s'] = 1 

如果由于某种原因,你使用STR的元组和类型的字典的列表,你可以使用这个列表理解,而不是

[[i[0],k1,v1] for i in d for k1,v1 in i[1].items()] 
+0

很好的答案。一个问题:'d.sort(['File','Count'],升序= [1,1])'也会改变索引。你为什么要这样做的任何特殊原因? –

+0

@JoeR我只是改变了它,所以文件从低到高的顺序,然后设置相同的计数。这不是必要的,但我认为它看起来好一点。 – SirParselot

1

这样的事情呢?第一

word_list = ["it's","they're","there's","he's"] 

frequencies = [('document1',{"it's": 0,"they're": 2,"there's": 5,"he's": 1}), 
('document2',{"it's": 4,"they're": 2,"there's": 3,"he's": 0}), 
('document3',{"it's": 7,"they're": 0,"there's": 4,"he's": 1})] 

result = [] 
for document in frequencies: 
    for word in word_list: 
     result.append({"file":document[0], "word":word,"count":document[1][word]}) 

print result 
+0

我得到以下错误:'TypeError:字符串索引必须是整数,而不是str'。我不能使用这个词本身来索引 – blacksite

+0

您是否使用与我相同的数据运行代码?唯一可能失败的地方是'document [1] [word]',并且'document [1]'中的所有键都是提供的数据中的字符串。不应该失败。编辑:第二个想到的错误意味着你试图访问另一个字符串的字符串的元素。你的频率是否包含任何原始字符串? – Jephron

+0

我不这么认为。从字面上看,这虽然比我使用的实际数据简单得多。它遵循完全相同的语法结构,但“频率”只是方式更容易谈论 – blacksite