2013-04-28 37 views
0

我有一个三列的文件(用\ t分隔;第一列是词,第二列是词条,第三列是标签)。有些行只包含点或逗号。如何从整个文件的列表中统计词频?

<doc n=1 id="CMP/94/10"> 
<head p="80%"> 
Customs customs tag1 
union union tag2 
in in tag3 
danger danger tag4 
of of tag5 
the the tag6 
</head> 
<head p="80%"> 
New new tag7 
restrictions restriction tag8 
in in tag3 
the the tag6 
. 
Hi hi tag8 

假设用户搜索引理“in”。我想要在“in”之前和之后的“in”的频率和引理的频率。所以我想要在整个语料库中使用“联合”,“危险”,“限制”和“该”的频率。结果应该是:

union 1 
danger 1 
restriction 1 
the 2 

我该怎么做?我试图使用lemma_counter = {}但它不起作用。

我没有经验的Python语言,所以请纠正我,如果我有什么问题。

c = open("corpus.vert") 

corpus = [] 

for line in c: 
    if not line.startswith("<"): 
     corpus.append(line) 

lemma = raw_input("Lemma you are looking for: ") 

counter = 0 
lemmas_before_after = []  
for i in range(len(corpus)): 
    parsed_line = corpus[i].split("\t") 
    if len(parsed_line) > 1: 
     if parsed_line[1] == lemma: 
      counter += 1 #this counts lemma frequency 


      new_list = [] 

      for j in range(i-1, i+2): 
       if j < len(corpus) and j >= 0: 
        parsed_line_with_context = corpus[j].split("\t") 
     found_lemma = parsed_line_with_context[0].replace("\n","") 
     if len(parsed_line_with_context) > 1: 
      if lemma != parsed_line_with_context[1].replace("\n",""):       
      lemmas_before_after.append(found_lemma)   
     else: 
      lemmas_before_after.append(found_lemma)     

print "list of lemmas ", lemmas_before_after 


lemma_counter = {} 
for i in range(len(corpus)): 
    for lemma in lemmas_before_after: 
     if parsed_line[1] == lemma: 
      if lemma in lemma_counter: 
       lemma_counter[lemma] += 1 
      else: 
       lemma_counter[lemma] = 1 

print lemma_counter 


fA = counter 
print "lemma frequency: ", fA 

回答

0

这应该让你80%的方式。

# Let's use some useful pieces of the awesome standard library 
from collections import namedtuple, Counter 

# Define a simple structure to hold the properties of each entry in corpus 
CorpusEntry = namedtuple('CorpusEntry', ['word', 'lemma', 'tag']) 

# Use a context manager ("with...") to automatically close the file when we no 
# longer need it 
with open('corpus.vert') as c: 
    corpus = [] 
    for line in c: 
     if len(line.strip()) > 1 and not line.startswith('<'): 
      # Remove the newline character and split at tabs 
      word, lemma, tag = line.strip().split('\t') 
      # Put the obtained values in the structure 
      entry = CorpusEntry(word, lemma, tag) 
      # Put the structure in the corpus list 
      corpus.append(entry) 

# It's practical to wrap the counting in a function 
def get_frequencies(lemma): 
    # Create a set of indices at which the lemma occurs in corpus. We use a 
    # set because it is more efficient for the next part, checking if some 
    # index is in this set 
    lemma_indices = set() 
    # Loop over corpus without manual indexing; enumerate provides information 
    # about the current index and the value (some CorpusEntry added earlier). 
    for index, entry in enumerate(corpus): 
     if entry.lemma == lemma: 
      lemma_indices.add(index) 

    # Now that we have the indices at which the lemma occurs, we can loop over 
    # corpus again and for each entry check if it is either one before or 
    # one after the lemma. If so, add the entry's lemma to a new set. 
    related_lemmas = set() 
    for index, entry in enumerate(corpus): 
     before_lemma = index+1 in lemma_indices 
     after_lemma = index-1 in lemma_indices 
     if before_lemma or after_lemma: 
      related_lemmas.add(entry.lemma) 

    # Finally, we need to count the number of occurrences of those related 
    # lemmas 
    counter = Counter() 
    for entry in corpus: 
     if entry.lemma in related_lemmas: 
      counter[entry.lemma] += 1 

    return counter 

print get_frequencies('in') 
# Counter({'the': 2, 'union': 1, 'restriction': 1, 'danger': 1}) 

它可以更简洁(见下文)编写的,该算法可以改进为好,但它仍然为O(n);关键是让它可以理解。

对于那些有兴趣:

with open('corpus.vert') as c: 
    corpus = [CorpusEntry(*line.strip().split('\t')) for line in c 
       if len(line.strip() > 1) and not line.startswith('<')] 

def get_frequencies(lemma): 
    lemma_indices = {index for index, entry in enumerate(corpus) 
        if entry.lemma == lemma} 
    related_lemmas = {entry.lemma for index, entry in enumerate(corpus) 
         if lemma_indices & {index+1, index-1}} 
    return Counter(entry.lemma for entry in corpus 
        if entry.lemma in related_lemmas) 

而且这里的多个程序的风格,它作为快三倍:

def get_frequencies(lemma): 
    counter = Counter() 
    related_lemmas = set() 
    for index, entry in enumerate(corpus): 
     counter[entry.lemma] += 1 
     if entry.lemma == lemma: 
      if index > 0: 
       related_lemmas.add(corpus[index-1].lemma) 
      if index < len(corpus)-1: 
       related_lemmas.add(corpus[index+1].lemma) 
    return {lemma: frequency for lemma, frequency in counter.iteritems() 
      if lemma in related_lemmas} 
+0

谢谢您的答复。我发现,我的文件并不完全符合我的预期。有些行只包含一个点或逗号,所以元组不会为它们工作。我试过这个:'如果不是line.startswith('<'):' '如果len(line)> 1:'但它仍然给我一个错误“需要多个值才能解包”。 – halik 2013-04-29 07:32:32

+0

@halik你必须考虑到每个'line'在它被添加到'corpus'之前,还包含新的行字符('\ n'),所以最初每个'line'的长度大于1. I调整了我的答案。 – 2013-04-29 10:13:52

相关问题