如何识别列表中的项目与另一个列表中的项目的发生

-1

我有一个文件加载了一列文本。我想检查加载的文本中国家名称的发生。我已经加载了维基百科国家的CSV文件，我使用下面的代码来计算加载文本中国家名称的出现次数。如何识别列表中的项目与另一个列表中的项目的发生

我的代码无法正常工作。

这里是我的代码： text = pd.read_sql(select_string, con) text['tokenized_text'] = mail_text.apply(lambda col:nltk.word_tokenize(col['SomeText']), axis=1) country_codes = pd.read_csv('wikipedia-iso-country-codes.csv') ccs = set(country_codes['English short name lower case']) count_occurrences=Counter(country for country in text['tokenized_text']if country in ccs)

来源

2016-09-20 JayDoe

是'country_codes'的'dictionary'？ –

你现在的代码有一个缩进错误 - 你应该先看看。 –

不，缩进只是我在这里剪切和粘贴的结果 – JayDoe

在你原来的代码行

dic[country]= dic[country]+1

应引起KeyError，因为关键是还没有出现在字典中，当一个国家被满足第一次。相反，你应该检查重点是存在的，如果不是，初始化值设为1。

在另一方面，它不会，因为检查

if country in country_codes['English short name lower case']:

收益率对于所有的值False：一Series对象的__contains__与indices instead of values一起使用。你应该例如检查

if country in country_codes['English short name lower case'].values:

如果你的list of values is short。

对于一般计数任务，Python提供collections.Counter，它的行为有点像defaultdict(int)，但带来了额外的好处。它删除键等的人工检查的需要

正如你已经有DataFrame对象，你可以使用的工具pandas规定：

In [12]: country_codes = pd.read_csv('wikipedia-iso-country-codes.csv') 

In [13]: text = pd.DataFrame({'SomeText': """Finland , Finland , Finland 
    ...: The country where I want to be 
    ...: Pony trekking or camping or just watch T.V. 
    ...: Finland , Finland , Finland 
    ...: It's the country for me 
    ...: 
    ...: You're so near to Russia 
    ...: so far away from Japan 
    ...: Quite a long way from Cairo 
    ...: lots of miles from Vietnam 
    ...: 
    ...: Finland , Finland , Finland 
    ...: The country where I want to be 
    ...: Eating breakfast or dinner 
    ...: or snack lunch in the hall 
    ...: Finland , Finland , Finland 
    ...: Finland has it all 
    ...: 
    ...: Read more: Monty Python - Finland Lyrics | MetroLyrics 
    ...: """.split()}) 

In [14]: text[text['SomeText'].isin(
    ...:  country_codes['English short name lower case'] 
    ...:)]['SomeText'].value_counts().to_dict() 
    ...: 
Out[14]: {'Finland': 14, 'Japan': 1}

此发现的text行，其中SomeText列的值是英文简称英文简称country_codes列，计算唯一值SomeText，并转换为字典。

In [49]: where_sometext_isin_country_codes = text['SomeText'].isin(
    ...:  country_codes['English short name lower case']) 

In [50]: filtered_text = text[where_sometext_isin_country_codes] 

In [51]: value_counts = filtered_text['SomeText'].value_counts() 

In [52]: value_counts.to_dict() 
Out[52]: {'Finland': 14, 'Japan': 1}

相同与Counter：

In [23]: from collections import Counter 

In [24]: dic = Counter() 
    ...: ccs = set(country_codes['English short name lower case']) 
    ...: for country in text['SomeText']: 
    ...:  if country in ccs: 
    ...:   dic[country] += 1 
    ...: 

In [25]: dic 
Out[25]: Counter({'Finland': 14, 'Japan': 1})

或简单地：用描述中间变量的相同

In [30]: ccs = set(country_codes['English short name lower case']) 

In [31]: Counter(country for country in text['SomeText'] if country in ccs) 
Out[31]: Counter({'Finland': 14, 'Japan': 1})

来源

2016-09-20 08:45:57

那么俄罗斯和越南发生了什么？他们不再是国家吗？我认为源数据可能会更好...... – Frangipanes

俄罗斯在那里，但它不只是“俄罗斯”，而是“俄罗斯联邦”。另一方面越南不是。 OP的数据和方法可以使用一些改进。 –

关于俄罗斯的好处，因为它从来没有被称为“俄罗斯联邦”，而只是“俄罗斯”，所以也许我需要找到另一个国家代码的源文件？ – JayDoe

如何识别列表中的项目与另一个列表中的项目的发生

回答

相关问题