2017-05-07 33 views
0

我需要解析csv文件。CSV读取列的值

输入:文件名+

Index | writer | year | words 
    0  | Philip | 1994 | this is first row 
    1  | Heinz | 2000 | python is wonderful (new line) second line 
    2  | Thomas | 1993 | i don't like this 
    3  | Heinz | 1898 | this is another row 
    .  |  .  | . |  . 
    .  |  .  | . |  . 
    N  | Fritz | 2014 | i hate man united 

输出:对应所有单词列表来命名

l = ['python is wonderful second line', 'this is another row'] 

我有什么企图?

import csv 
import sys 

class artist: 
    def __init__(self, name, file): 
     self.file = file 
     self.name = name 
     self.list = [] 

    def extractText(self): 
     with open(self.file, 'rb') as f: 
      reader = csv.reader(f) 
      temp = list(reader) 
     k = len(temp) 
     for i in range(1, k): 
      s = temp[i] 
      if s[1] == self.name: 
       self.list.append(str(s[3])) 


if __name__ == '__main__': 
    # arguements 
    inputFile = str(sys.argv[1]) 
    Heinz = artist('Heinz', inputFile) 
    Heinz.extractText() 
    print(Heinz.list) 

输出是:

["python is wonderful\r\nsecond line", 'this is another row'] 

如何获取包含单词的多行细胞摆脱\r\n,并且可以循环作为其极其缓慢得到改善呢?

回答

1

这至少应该更快,因为你正在分析你正在阅读的文件,然后剥离掉不需要的回车和换行字符,如果它们的存在。

with open(self.file) as csv_fh: 
    for n in csv.reader(csv_fh): 
     if n[1] == self.name: 
      self.list.append(n[3].replace('\r\n', ' ') 
1

你可以简单地使用大熊猫以获取列表:

import pandas 
df = pandas.read_csv('test1.csv') 
index = df[df['writer'] == "Heinz"].index.tolist() # get the specific name's index 
l = list() 
for i in index: 
    l.append(df.iloc[i, 3].replace('\n','')) # get the cell and strip new line '\n', append to list. 
l 

输出:

['python is wonderful second line', 'this is another row'] 
+0

这不是我想要的。我需要一个特定的作家/艺术家的话。不是所有的单词。 –

+0

@TonyTannous更新了特定的作家答案。 –

1

入门中s[3]摆脱换行:我建议' '.join(s[3].splitlines())。见单证为"".splitlines,又见"".translate

改善循环:

def extractText(self): 
    with open(self.file, 'rb') as f: 
     for s in csv.reader(f): 
      s = temp[i] 
      if s[1] == self.name: 
       self.list.append(str(s[3])) 

这节省了一个传过来的数据。

但请考虑@ Tiny.D的意见,并给大熊猫一个尝试。

+0

但他们我有删除一些行前举行中的每个对象全部文本。不是吗?我需要的不是所有的特定单词。 –

+0

原始代码复制所有文件内容存储在存储器'临时=列表(读取器)';这里每一行检查S [1] == self.name;大多数线路被丢弃。 – tiwo

0

要折叠多个白色空间,您可以使用正则表达式,并加快了一点东西,尝试循环理解:

import re 

def extractText(self): 
    RE_WHITESPACE = re.compile(r'[ \t\r\n]+') 
    with open(self.file, 'rU') as f: 
     reader = csv.reader(f) 

     # skip the first line 
     next(reader) 

     # put all of the words into a list if the artist matches 
     self.list = [RE_WHITESPACE.sub(' ', s[3]) 
        for s in reader if s[1] == self.name]