提取几个列表中的常见元素

-1

一般来说，我想要做的是在几个csv文件的“word”共享列中提取常用元素。（2008.csv，2009.csv，2010.csv .... 2015.csv）提取几个列表中的常见元素

所有文件都在相同的格式： '字'， '计数'

'字' 包含一年中某个文件中的所有常用词汇。

这里是一个文件的快照：

file 2008.csv

只要存在具有共同的元素的两个出8个文件，我想知道这些共享的元素和无论他们在哪里（这是非常像tfidf计算... btw）

无论如何，我的目标是要知道一些频繁的词出现在这些f尔斯。（据我所知，一个元素最多可以在五个文件中）

我想知道这些词何时首次出现，即文件C中的一个词，但不是文件B和A中的词。

我知道+如果可能解决问题在这里，但它是非常繁琐的，我需要比较8中的2，8中的3，或8列中的4，在这种情况下，寻找共享元素。

这是我的工作了那么远，远离了我所需要的代码...我只是比较两个元素出8个文件： code

谁能帮助？

来源

2016-02-16 ShirleyWang

你忘了发布你到目前为止的代码。 –

请在您的问题中提供相关信息。链接可以删除，我们在这里帮助*你*。如果您能轻松一点，我们将不胜感激。 – zondo

这是如何像TFxIDF？你已经存档了DF，但它在那里结束。 – tripleee

使用设置intersection可以帮助

for i in range(len(year_list)): 
    datai=set(pd.read_csv('filename_'+year_list[i]+'.csv')['word']) 
    tocompare=[] 
    for j in range(i+1,len(year_list)): 
     dataj=set(pd.read_csv('filename_'+year_list[j]+'.csv')['word']) 
     print "Two files:",i,j 
     print datai.intersection(dataj) 
     tocompare.append(dataj) 
    print "All compare:" 
    print datai.intersection(*tocompare) 
    break

来源

2016-02-16 02:59:10 platinhom

谢谢！但这种方式在比较关键词的两年（或文件）方面仍然有限。无论如何都要在所有八个文件之间进行比较？ – ShirleyWang

'交集'方法可以接受多个参数！所以你只需要读取包含的其他文件并将它们全部放到方法中，就像：'datai.intersection（dataj，datak，datam ....）' – platinhom

还有一些代码问题..“All比较“可以向前进行，这意味着2012年可以与2013年到2015年的合并数据进行比较，但不会2011年。当我在特定年份尝试查找独特词语时，这会造成问题。例如，2011年出现但2013年不出现的词将被视为2012年的唯一词。 – ShirleyWang

第一个答案都很顺利普遍。但由于某些原因，相交函数不会返回我预期的确切结果。所以我修改了提供的代码，以提高打印输出的准确性和更好的格式。

for i in range(0,8): 
otheryears = [] 
if i>0: 
    for y in range(0,i): 
     datay = set(pd.read_csv("most_50_common_words_"+year_list[y]+'.csv')["word"]) 
     for y in list(datay): 
      if y not in otheryears: 
       otheryears.append(y)  
uniquei = [] 
datai = set(pd.read_csv("most_50_common_words_"+year_list[i]+'.csv')["word"]) 
print "\nCompare year %d with:\n" % int(year_list[i]) 
for j in range(i+1,8): 
    dataj = set(pd.read_csv("most_50_common_words_"+year_list[j]+'.csv')['word']) 
    print year_list[j],':' 
    listj = list(datai.intersection(dataj)) 
    print list(datai.intersection(dataj)),'\n',"%d common words with year %d" % (len(datai.intersection(dataj)),int(year_list[j])) 
    for j in list(dataj): 
     if j not in otheryears: 
      otheryears.append(j) 

common = [] 
for x in list(datai): 
    if x in otheryears: 
     common.append(x) 
print "\nAll compare:" 
print "%d year has %d words in common with other years. They are as follows:\n%s" % (int(year_list[i]), 
                        len(common),common),'\n' 
for x in list(datai): 
    if x not in otheryears: 
     uniquei.append(x) 
print "%d Frequent words unique in year %d:\n%s \n" % (len(uniquei),int(year_list[i]),uniquei)

来源

2016-02-16 18:37:39 ShirleyWang

提取几个列表中的常见元素

回答

相关问题