2012-06-05 62 views
-2

我必须处理.txt文件中的子文件夹presnent一个Folder.like内:
New Folder>Folder 1 to 6>xx.txt & yy.txt(files present in each folder)
每个文件包含两列:步行和平均值

arg his 
asp gln 
glu his 

arg his 
glu arg 
arg his 
glu asp 

现在我要做的是:
1)计算每个文件每个单词出现的次数>和平均总数计数除以total no. of lines in that file
2)然后用完成第1步后获得的值,将值除以总数。 (即在这种情况下是2) 我已经尝试过使用我的代码,如下所示:
但是我在第一种情况下成功了,但是我没有得到第二种情况。

for root,dirs,files in os.walk(path): 
    aspCount = 0 
    glu_count = 0 
    lys_count = 0 
    arg_count = 0 
    his_count = 0 
    acid_count = 0 
    base_count = 0 
    count = 0 
    listOfFile = glob.iglob(os.path.join(root,'*.txt') 
    for filename in listOfFile: 
     lineCount = 0 
     asp_count_col1 = 0 
     asp_count_col2 = 0 
     glu_count_col1 = 0 
     glu_count_col2 = 0 
     lys_count_col1 = 0 
     lys_count_col2 = 0 
     arg_count_col1 = 0 
     arg_count_col2 = 0 
     his_count_col1 = 0 
     his_count_col2 = 0 
     count += 1 
     for line in map(str.split,inp): 
      saltCount += 1 
      k = line[4] 
      m = line[6] 
      if k == 'ASP': 
       asp_count_col1 += 1 
      elif m == 'ASP': 
       asp_count_col2 += 1 
      if k == 'GLU': 
       glu_count_col += 1 
      elif m == 'GLU': 
       glu_count_col2 += 1 
      if k == 'LYS': 
       lys_count_col1 += 1 
      elif m == 'LYS': 
       lys_count_col2 += 1 
      if k == 'ARG': 
       arg_count_col1 += 1 
      elif m == 'ARG': 
       arg_count_col2 += 1 
      if k == 'HIS': 
       his_count_col1 += 1 
      elif m == 'HIS': 
       his_count_col2 += 1 
     asp_count = (float(asp_count_col1 + asp_count_col2))/lineCount 
     glu_count = (float(glu_count_col1 + glu_count_col2))/lineCount 
     lys_count = (float(lys_count_col1 + lys_count_col2))/lineCount 
     arg_count = (float(arg_count_col1 + arg_count_col2))/lineCount 
     his_count = (float(his_count_col1 + his_count_col2))/lineCount 

直到这我能够得到每个文件的平均值。但是我怎么能够得到每个子文件夹的平均值(即除以count(文件总数))。 问题是第二部分。第一部分完成。所提供的代码将为每个文件取平均值。但我想补充这个平均值,并用总数除以得出一个新的平均值。存在于子文件夹中的文件。

+1

定义规范时,它有助于非常精确或非常冗余。例如,“每个子文件夹”可能意味着很多事情。步骤#1可以使用一个例子(例如'arg arg \ n他的arg'会导致'{'arg':3/2,'his':1/2}')并被称为“每对平均氨基酸”。这也将有助于给出他们为什么配对的背景:大概是它的两股DNA? – ninjagecko

+0

@ninjagecko,但这不是我关心的问题。我只想专注于数字而不是“配对”或氨基酸的东西。 – Ovisek

+1

我不明白问题在哪里。它是如何计数子文件夹中的文件? –

回答

0

您使用os.walk连同glob.iglob是假的。既可以使用其中一种,也可以不使用两种。这是我会怎么做:

import os, os.path, re, pprint, sys 
#... 
for root, dirs, files in os.walk(path): 
    counts = {} 
    nlines = 0 
    for f in filter(lambda n: re.search(r'\.txt$', n), files): 
    for l in open(f, 'rt'): 
     nlines += 1 
     for k in l.split(): 
     counts[k] = counts[k]+1 if k in counts else 1 
    for k, v in counts.items(): 
    counts[k] = float(v)/nlines 

    sys.stdout.write('Frequencies for directory %s:\n'%root 
    pprint.pprint(counts) 
1
import os 
from collections import * 

aminoAcids = set('asp glu lys arg his'.split()) 

filesToCounts = {} 

for root,dirs,files in os.walk(subfolderPath): 
    for file in files: 
     if file.endswith('.txt'): 
      path = os.path.join(root,file) 
      with open(path) as f: 
       acidsInFile = f.read().split() 

      assert all(a in aminoAcids for a in acidsInFile) 
      filesToCounts[file] = Counter(acidsInFile) 

def averageOfCounts(counts): 
    numberOfAcids = sum(counts.values()) 
    assert numberOfAcids%2==0 
    numberOfAcidPairs = numberOfAcids/2 
    return dict((acid,acidCount/numberOfAcidPairs) for acid,acidCount in counts.items()) 

filesToAverages = dict((file,averageOfCounts(counts)) for file,counts in filesToCounts.items()) 
0

我喜欢ninjagecko的答案,但不同理解的问题。以他的代码为出发点,我提出以下建议:

import os 
from collections import * 

aminoAcids = set('asp glu lys arg his'.split()) 

subfolderFreqs = {} 

for root,dirs,files in os.walk(subfolderPath): 
    cumulativeFreqs = defaultdict(int) 
    fileCount = 0 
    for file in files: 
     if file.endswith('.txt'): 
      fileCount += 1 
      path = os.path.join(root,file) 
      with open(path) as f: 
       acidsInFile = f.read().split() 

      counts = Counter(acidsInFile) 
      assert aminoAcids.issuperset(counts) 
      numberOfAcidPairs = len(acidsInFile)/2 
      for acid, acidCount in counts.items(): 
       cumulativeFreqs[acid] += float(acidCount)/numberOfAcidPairs 
    if fileCount: 
     subfolderFreqs[root] = {acid: cumulative/fileCount for acid, cumulative in cumulativeFreqs.items()} 

print subfolderFreqs