用某些词汇打印每个短语/单词的频率？

我有一个文件，其中包含乐队列表以及专辑的制作年份。我需要编写一个函数来查看这个文件，并找出这些乐队的不同名称，并计算出这些乐队在这个文件中出现的次数。用某些词汇打印每个短语/单词的频率？

文件的样子是这样的：

Beatles - Revolver (1966) 
Nirvana - Nevermind (1991) 
Beatles - Sgt Pepper's Lonely Hearts Club Band (1967) 
U2 - The Joshua Tree (1987) 
Beatles - The Beatles (1968) 
Beatles - Abbey Road (1969) 
Guns N' Roses - Appetite For Destruction (1987) 
Radiohead - Ok Computer (1997) 
Led Zeppelin - Led Zeppelin 4 (1971) 
U2 - Achtung Baby (1991) 
Pink Floyd - Dark Side Of The Moon (1973) 
Michael Jackson -Thriller (1982) 
Rolling Stones - Exile On Main Street (1972) 
Clash - London Calling (1979) 
U2 - All That You Can't Leave Behind (2000) 
Weezer - Pinkerton (1996) 
Radiohead - The Bends (1995) 
Smashing Pumpkins - Mellon Collie And The Infinite Sadness (1995) 
. 
. 
.

输出必须是在按频率的降序，看起来像这样：

band1: number1 
band2: number2 
band3: number3

这里是我到目前为止的代码：

def read_albums(filename) : 

    file = open("albums.txt", "r") 
    bands = {} 
    for line in file : 
     words = line.split() 
     for word in words: 
      if word in '-' : 
       del(words[words.index(word):]) 
     string1 = "" 
     for i in words : 
      list1 = [] 

      string1 = string1 + i + " " 
      list1.append(string1) 
     for k in list1 : 
      if (k in bands) : 
       bands[k] = bands[k] +1 
      else : 
       bands[k] = 1 


    for word in bands : 
     frequency = bands[word] 
     print(word + ":", len(bands))

我认为有一个更简单的方法来做到这一点，但我不确定。另外，我不确定如何按频率对字典进行排序，是否需要将其转换为列表？

来源

2013-08-07 Preston May

查看['collections.Counter']（http://docs.python.org/2/library/collections.html#collections。计数器） –

你说得对，还有一个更简单的方法，用Counter：

from collections import Counter 

with open('bandfile.txt') as f: 
    counts = Counter(line.split('-')[0].strip() for line in f if line) 

for band, count in counts.most_common(): 
    print("{0}:{1}".format(band, count))

究竟是什么做的这样： if line？

这条线是下面的循环的长型：

temp_list = [] 
for line in f: 
    if line: # this makes sure to skip blank lines 
     bits = line.split('-') 
     temp_list.add(bits[0].strip()) 

counts = Counter(temp_list)

但是，与上面的循环 - 它不会创建一个中介名单。相反，它会创建一个generator expression--更有效地解决问题的内存方式;它被用作Counter的参数。

来源

2013-08-07 16:39:01

请注意'计数器'只适用于2.7及更高版本。如果你使用的东西比那更早，请查看这里接受的答案：http://stackoverflow.com/questions/613183/python-sort-a-dictionary-by-value –

我还是很新的python，那么with语句做什么？不在此代码中，但总体而言。 –

http://docs.python.org/2/reference/compound_stmts。html＃＃ –

如果您正在寻找简洁，使用“defaultdict”和“分类”

from collections import defaultdict 
bands = defaultdict(int) 
with open('tmp.txt') as f: 
    for line in f.xreadlines(): 
     band = line.split(' - ')[0] 
     bands[band] += 1 
for band, count in sorted(bands.items(), key=lambda t: t[1], reverse=True): 
    print '%s: %d' % (band, count)

来源

2013-08-07 16:42:59 thierrybm

为什么要排序？该问题不要求排序输出。请注意'collections.Counter（）。most_common（）'将会更加简洁，因为它会按照频率为您反向排序。 –

正确;当我写我的时候没有看到Counter解决方案，那更好！ – thierrybm

我的做法是使用split()方法将文件中的行打入成分标记列表。然后，你可以抓住乐队的名字（在列表中第一个标记），并开始添加名称字典来跟踪计数：

import operator 

def main(): 
    f = open("albums.txt", "rU") 
    band_counts = {} 

    #build a dictionary that adds each band as it is listed, then increments the count for re-lists 
    for line in f: 
    line_items = line.split("-") #break up the line into individual tokens 
    band = line_items[0] 

    #don't want to add newlines to the band list 
    if band == "\n": 
    continue 

    if band in band_counts: 
    band_counts[band] += 1 #band already in the counts, increment the counts 
    else: 
    band_counts[band] = 1 #if the band was not already in counts, add it with a count of 1 

    #create a list of sorted results 
    sorted_list = sorted(band_counts.iteritems(), key=operator.itemgetter(1)) 

    for item in sorted_list: 
    print item[0], ":", item[1]

注：

我跟着的建议这个答案创建排序结果：Sort a Python dictionary by value
如果您是Python的新手，请查看Google的Python类。当我刚刚开始时，我发现它非常有用：https://developers.google.com/edu/python/?csw=1

来源

2013-08-07 17:38:11 caffreyd

用某些词汇打印每个短语/单词的频率？

回答

相关问题