CSV读取列的值

我需要解析csv文件。CSV读取列的值

输入：文件名+

Index | writer | year | words 
    0  | Philip | 1994 | this is first row 
    1  | Heinz | 2000 | python is wonderful (new line) second line 
    2  | Thomas | 1993 | i don't like this 
    3  | Heinz | 1898 | this is another row 
    .  |  .  | . |  . 
    .  |  .  | . |  . 
    N  | Fritz | 2014 | i hate man united

输出：对应所有单词列表来命名

l = ['python is wonderful second line', 'this is another row']

我有什么企图？

import csv 
import sys 

class artist: 
    def __init__(self, name, file): 
     self.file = file 
     self.name = name 
     self.list = [] 

    def extractText(self): 
     with open(self.file, 'rb') as f: 
      reader = csv.reader(f) 
      temp = list(reader) 
     k = len(temp) 
     for i in range(1, k): 
      s = temp[i] 
      if s[1] == self.name: 
       self.list.append(str(s[3])) 


if __name__ == '__main__': 
    # arguements 
    inputFile = str(sys.argv[1]) 
    Heinz = artist('Heinz', inputFile) 
    Heinz.extractText() 
    print(Heinz.list)

输出是：

["python is wonderful\r\nsecond line", 'this is another row']

如何获取包含单词的多行细胞摆脱\r\n，并且可以循环作为其极其缓慢得到改善呢？

来源

2017-05-07 Tony Tannous

这至少应该更快，因为你正在分析你正在阅读的文件，然后剥离掉不需要的回车和换行字符，如果它们的存在。

with open(self.file) as csv_fh: 
    for n in csv.reader(csv_fh): 
     if n[1] == self.name: 
      self.list.append(n[3].replace('\r\n', ' ')

来源

2017-05-07 23:37:33 salparadise

你可以简单地使用大熊猫以获取列表：

import pandas 
df = pandas.read_csv('test1.csv') 
index = df[df['writer'] == "Heinz"].index.tolist() # get the specific name's index 
l = list() 
for i in index: 
    l.append(df.iloc[i, 3].replace('\n','')) # get the cell and strip new line '\n', append to list. 
l

输出：

['python is wonderful second line', 'this is another row']

来源

2017-05-07 23:27:13

这不是我想要的。我需要一个特定的作家/艺术家的话。不是所有的单词。 –

@TonyTannous更新了特定的作家答案。 –

入门中s[3]摆脱换行：我建议' '.join(s[3].splitlines())。见单证为"".splitlines，又见"".translate。

改善循环：

def extractText(self): 
    with open(self.file, 'rb') as f: 
     for s in csv.reader(f): 
      s = temp[i] 
      if s[1] == self.name: 
       self.list.append(str(s[3]))

这节省了一个传过来的数据。

但请考虑@ Tiny.D的意见，并给大熊猫一个尝试。

来源

2017-05-07 23:33:47 tiwo

但他们我有删除一些行前举行中的每个对象全部文本。不是吗？我需要的不是所有的特定单词。 –

原始代码复制所有文件内容存储在存储器'临时=列表（读取器）';这里每一行检查S [1] == self.name;大多数线路被丢弃。 – tiwo

要折叠多个白色空间，您可以使用正则表达式，并加快了一点东西，尝试循环理解：

import re 

def extractText(self): 
    RE_WHITESPACE = re.compile(r'[ \t\r\n]+') 
    with open(self.file, 'rU') as f: 
     reader = csv.reader(f) 

     # skip the first line 
     next(reader) 

     # put all of the words into a list if the artist matches 
     self.list = [RE_WHITESPACE.sub(' ', s[3]) 
        for s in reader if s[1] == self.name]

来源

2017-05-07 23:39:28

CSV读取列的值

回答

相关问题