2015-11-19 42 views
-1

我有一个字符串列表,它有重复的值,我想创建单词字典,其中键将是单词,其值将是频率计数,然后写下这些文字和它们的值在CSV:将重复值的列表转换为Python中的频率计数字典

以下是我的方式做同样的:

#!/usr/bin/env python 
# encoding: utf-8 

# -*- coding: utf8 -*- 
import csv 
from nltk.tokenize import TweetTokenizer 
import numpy as np 

tknzr = TweetTokenizer() 

#print tknzr.tokenize(s0) 

with open("dispn.csv","r") as file1,\ 
    open("dispn_tokenized.csv","w") as file2,\ 
    open("dispn_tokenized_count.csv","w") as file3: 

    mycsv = list(csv.reader(file1)) 

    words = [] 
    words_set = [] 
    tokenize_count = {} 
    for row in mycsv: 

     lst = tknzr.tokenize(row[2]) 
     for l in lst: 
      file2.write("\""+str(row[2])+"\""+","+"\""+str(l.encode('utf-8'))+"\""+"\n") 
      l = l.lower() 
      words.append(l) 
    words_set = list(set(words)) 
    print "len of words_set : " + str(len(words_set)) 
    for word in words_set: 
     tokenize_count[word] = 1 

    for word in words: 
     tokenize_count[word] = tokenize_count[word]+1 




    print "len of tokenized words_set : " + str(len(tokenize_count)) 

    #print "Tokenized_words count : " 
    #print tokenize_count 
    #print "=================================================================" 

    i = 0 
    for wrd in words_set: 
     #i = i+1 
     print "i : " +str(i) 
     file3.write("\""+str(i)+"\""+","+"\""+str(wrd.encode('utf-8'))+"\""+","+"\""+str(tokenize_count[wrd])+"\""+"\n") 

但在CSV我还是发现了像1,5,4,7一些重复值,9。

的方法的一些信息:

- dispn.csv = contains usernames of the users 
     which i am tokenizing with the help of nltk module 
    - after tokenizing them, i am storing these words in the list 'words' 
     and writing the words corresponding to the username to csv. 
    - creating set of it so as to get unique values out of list 'words' 
     and storing it in 'words_set' 
    - then creating dictionary 'tokenize_count' with key as word and 
     value as its frequency count and writing the same to csv. 

为什么会收到只有一些数值的重复?有没有更好的方法来做同样的事情?请帮忙。

+1

['进口从collections'计数器](https://开头docs.python.org/2/library/collections.html#collections.Counter) –

+0

[如何计算列表中元素的频率?](http://stackoverflow.com/questions/2161752/how这个元素在列表中的频率) – alfasin

+0

@RNar:你可以发表你的评论作为答案,以便我会接受它吗?感谢它解决了我的问题 –

回答