步行和平均值

-2

我必须处理.txt文件中的子文件夹presnent一个Folder.like内：
New Folder>Folder 1 to 6>xx.txt & yy.txt(files present in each folder)
每个文件包含两列：步行和平均值

arg his 
asp gln 
glu his

和

arg his 
glu arg 
arg his 
glu asp

现在我要做的是：
1）计算每个文件每个单词出现的次数>和平均总数计数除以total no. of lines in that file
2）然后用完成第1步后获得的值，将值除以总数。（即在这种情况下是2）我已经尝试过使用我的代码，如下所示：
但是我在第一种情况下成功了，但是我没有得到第二种情况。

for root,dirs,files in os.walk(path): 
    aspCount = 0 
    glu_count = 0 
    lys_count = 0 
    arg_count = 0 
    his_count = 0 
    acid_count = 0 
    base_count = 0 
    count = 0 
    listOfFile = glob.iglob(os.path.join(root,'*.txt') 
    for filename in listOfFile: 
     lineCount = 0 
     asp_count_col1 = 0 
     asp_count_col2 = 0 
     glu_count_col1 = 0 
     glu_count_col2 = 0 
     lys_count_col1 = 0 
     lys_count_col2 = 0 
     arg_count_col1 = 0 
     arg_count_col2 = 0 
     his_count_col1 = 0 
     his_count_col2 = 0 
     count += 1 
     for line in map(str.split,inp): 
      saltCount += 1 
      k = line[4] 
      m = line[6] 
      if k == 'ASP': 
       asp_count_col1 += 1 
      elif m == 'ASP': 
       asp_count_col2 += 1 
      if k == 'GLU': 
       glu_count_col += 1 
      elif m == 'GLU': 
       glu_count_col2 += 1 
      if k == 'LYS': 
       lys_count_col1 += 1 
      elif m == 'LYS': 
       lys_count_col2 += 1 
      if k == 'ARG': 
       arg_count_col1 += 1 
      elif m == 'ARG': 
       arg_count_col2 += 1 
      if k == 'HIS': 
       his_count_col1 += 1 
      elif m == 'HIS': 
       his_count_col2 += 1 
     asp_count = (float(asp_count_col1 + asp_count_col2))/lineCount 
     glu_count = (float(glu_count_col1 + glu_count_col2))/lineCount 
     lys_count = (float(lys_count_col1 + lys_count_col2))/lineCount 
     arg_count = (float(arg_count_col1 + arg_count_col2))/lineCount 
     his_count = (float(his_count_col1 + his_count_col2))/lineCount

直到这我能够得到每个文件的平均值。但是我怎么能够得到每个子文件夹的平均值（即除以count（文件总数））。问题是第二部分。第一部分完成。所提供的代码将为每个文件取平均值。但我想补充这个平均值，并用总数除以得出一个新的平均值。存在于子文件夹中的文件。

来源

2012-06-05 Ovisek

定义规范时，它有助于非常精确或非常冗余。例如，“每个子文件夹”可能意味着很多事情。步骤＃1可以使用一个例子（例如'arg arg \ n他的arg'会导致'{'arg'：3/2，'his'：1/2}'）并被称为“每对平均氨基酸”。这也将有助于给出他们为什么配对的背景：大概是它的两股DNA？ – ninjagecko

@ninjagecko，但这不是我关心的问题。我只想专注于数字而不是“配对”或氨基酸的东西。 – Ovisek

我不明白问题在哪里。它是如何计数子文件夹中的文件？ –

您使用os.walk连同glob.iglob是假的。既可以使用其中一种，也可以不使用两种。这是我会怎么做：

import os, os.path, re, pprint, sys 
#... 
for root, dirs, files in os.walk(path): 
    counts = {} 
    nlines = 0 
    for f in filter(lambda n: re.search(r'\.txt$', n), files): 
    for l in open(f, 'rt'): 
     nlines += 1 
     for k in l.split(): 
     counts[k] = counts[k]+1 if k in counts else 1 
    for k, v in counts.items(): 
    counts[k] = float(v)/nlines 

    sys.stdout.write('Frequencies for directory %s:\n'%root 
    pprint.pprint(counts)

来源

2012-06-05 06:59:21

import os 
from collections import * 

aminoAcids = set('asp glu lys arg his'.split()) 

filesToCounts = {} 

for root,dirs,files in os.walk(subfolderPath): 
    for file in files: 
     if file.endswith('.txt'): 
      path = os.path.join(root,file) 
      with open(path) as f: 
       acidsInFile = f.read().split() 

      assert all(a in aminoAcids for a in acidsInFile) 
      filesToCounts[file] = Counter(acidsInFile) 

def averageOfCounts(counts): 
    numberOfAcids = sum(counts.values()) 
    assert numberOfAcids%2==0 
    numberOfAcidPairs = numberOfAcids/2 
    return dict((acid,acidCount/numberOfAcidPairs) for acid,acidCount in counts.items()) 

filesToAverages = dict((file,averageOfCounts(counts)) for file,counts in filesToCounts.items())

来源

2012-06-05 07:04:45 ninjagecko

我喜欢ninjagecko的答案，但不同理解的问题。以他的代码为出发点，我提出以下建议：

import os 
from collections import * 

aminoAcids = set('asp glu lys arg his'.split()) 

subfolderFreqs = {} 

for root,dirs,files in os.walk(subfolderPath): 
    cumulativeFreqs = defaultdict(int) 
    fileCount = 0 
    for file in files: 
     if file.endswith('.txt'): 
      fileCount += 1 
      path = os.path.join(root,file) 
      with open(path) as f: 
       acidsInFile = f.read().split() 

      counts = Counter(acidsInFile) 
      assert aminoAcids.issuperset(counts) 
      numberOfAcidPairs = len(acidsInFile)/2 
      for acid, acidCount in counts.items(): 
       cumulativeFreqs[acid] += float(acidCount)/numberOfAcidPairs 
    if fileCount: 
     subfolderFreqs[root] = {acid: cumulative/fileCount for acid, cumulative in cumulativeFreqs.items()} 

print subfolderFreqs

来源

2012-06-05 08:41:05

步行和平均值

回答

相关问题