2016-04-25 62 views
-1

我需要在('/ dir'/)空格之外获得csv文件的长度。 我尝试这样做:在多个csv文件中计算行数,跳空行

import os, csv, itertools, glob 

#To filer the empty lines 
def filterfalse(predicate, iterable): 
    # filterfalse(lambda x: x%2, range(10)) --> 0 2 4 6 8 
    if predicate is None: 
     predicate = bool 
    for x in iterable: 
     if not predicate(x): 
      yield x 

#To read each file in '/dir/', compute the length and write the output 'count.csv' 
with open('count.csv', 'w') as out: 
    file_list = glob.glob('/dir/*') 
    for file_name in file_list: 
     with open(file_name, 'r') as f: 
       filt_f1 = filterfalse(lambda line: line.startswith('\n'), f) 
       count = sum(1 for line in f if (filt_f1)) 
       out.write('{c} {f}\n'.format(c = count, f = file_name)) 

我得到我想要的输出,可惜每个文件的长度(“/ DIR /”),包括空行。

要看到空行来从我读file.csvfile.txt和它看起来像这样:

*text,favorited,favoriteCount,... 
"Retweeted user (@user):... 
'empty row' 
Do Operators...* 

回答

1

我会建议使用大熊猫。

import pandas 

# Reads csv file and converts it to pandas dataframe. 
df = pandas.read_csv('myfile.csv') 

# Removes rows where data is missing. 
df.dropna(inplace=True) 

# Gets length of dataframe and displays it. 
df_length = df.count + 1 
print('The length of the CSV file is', df_length) 

文档:http://pandas.pydata.org/pandas-docs/version/0.18.0/

1

filterfalse()功能正确执行。它的正好是与标准库itertools模块中名为ifilterfalse的模块相同,所以目前还不清楚为什么你不只是使用它而不是自己写 - 它的一个主要优点是它已经被测试和调试。 (内置插件通常也更快,因为很多都是用C编写的。)

问题是您没有正确使用generator function

  1. 由于它返回一个generator object,需要遍历它会潜在地yield使用类似for line in filt_f1代码的多个值。

  2. 您给出的谓词函数参数不能处理在其中具有其他前导空白字符(如空格和制表符)的行,并且不能正确处理。 - 所以你通过它的lambda需要修改以处理这些情况。

下面的代码有这两个变化。

import os, csv, itertools, glob 

#To filter the empty lines 
def filterfalse(predicate, iterable): 
    # filterfalse(lambda x: x%2, range(10)) --> 0 2 4 6 8 
    if predicate is None: 
     predicate = bool 
    for x in iterable: 
     if not predicate(x): 
      yield x 

#To read each file in '/dir/', compute the length and write the output 'count.csv' 
with open('count.csv', 'w') as out: 
    file_list = glob.glob('/dir/*') 
    for file_name in file_list: 
     with open(file_name, 'r') as f: 
      filt_f1 = filterfalse(lambda line: not line.strip(), f) # CHANGED 
      count = sum(1 for line in filt_f1) # CHANGED 
      out.write('{c} {f}\n'.format(c=count, f=file_name)) 
+0

谢谢,它部分工作(即我仍然可以找到一些空行) – user2278505