2015-12-15 44 views
0

我正在学习Python,但我没有很多编程经验。 我想建立一个例程来导入一个CSV文件,并迭代每行中有一个特定的键,并在一行中连接这些行。具有类似密钥的连接线

CSV文件:

'0001','key1','name' 
'0002','key1','age' 
'0001','key2','name' 
'0002','key2','age' 

生成的文件应该是:

['0001','key1','name','0002','key1','age'] 
['0001','key2','name','0002','key2','age'] 

我怎样才能做到这一点?

回答

3

阅读CSV:

import csv 

with open('my_csv.txt', 'rb') as f: 
    my_list = list(csv.reader(f)) 

在这一点上,my_list大概类似于列表的列表,如以下::

[['0001', 'key1', 'name'], ['0002', 'key1', 'age'], ['0001', 'key2', 'name'], ['0002', 'key2', 'age']] 

创建一个字典,每个键[数字]从对应于字典中的键的列表中,并且字典中的每个值对应于特定键的连接列表:

dict_of_lists = {} 

for item in my_list: 
    _, key, _ = item 
    if key in dict_of_lists.keys(): 
     dict_of_lists[key] = dict_of_lists[key] + item 
    else: 
     dict_of_lists[key] = item 

如果你不关心的列表项的顺序:

dict_of_lists.values() 

输出:

[['0001', 'key2', 'name', '0002', 'key2', 'age'], ['0001', 'key1', 'name', '0002', 'key1', 'age']] 

如果你关心的顺序:

​​

输出:

[['0001', 'key1', 'name', '0002', 'key1', 'age'], ['0001', 'key2', 'name', '0002', 'key2', 'age']] 
1

如果你可以负担得起在RAM中存储所有条目,使用defaultdict按键创建'bucket'条目将是一种方法(假设一个名为'file.csv'的文件):

from collections import defaultdict 

#this defaultdict acts as a Python dictionary, but creates an empty list 
# automatically in case the key doesn't exist 
entriesByKey = defaultdict(list) 

with open("file.csv") as f: 
    for line in f.readlines(): 
     #strips trailing whitespace and splits the line into a list 
     # using "," as a separator 
     entry = line.rstrip().split(",") 
     #the key is the second field in each entry 
     key = entry[1] 
     #concatenate entry to its respective key 'bucket' 
     entriesByKey[key] += entry 

#Now, we create a list of concatenated lines by key, sorting them 
# so that the keys appear in order 
out = [entriesByKey[key] for key in sorted(entriesByKey.keys())] 

#pretty-print the output :-) 
import pprint 
pprint.pprint(out) 

您的输入输出对这一计划的是:

[["'0001'", "'key1'", "'name'", "'0002'", "'key1'", "'age'"], 
["'0001'", "'key2'", "'name'", "'0002'", "'key2'", "'age'"]] 

唯一缺少将剥离单引号每个条目(也许格式化输出自己的喜好,而不是仅仅使用pprint() )。如果您可以保证您的输入格式正确并且字段始终使用单引号(或者更准确地说,条目中每个字段的第一个和最后一个字符永远都不相关),则可以通过添加以下内容key = entry[1]行:

entry = [field[1:-1] for field in entry] 

这将剥离每个字段的第一个和最后一个字符。

0

假设您的CSV文件不包含单引号(和那些仅用于演示这里)这应该工作:

import pandas as pd 
Data = pd.read_csv('Test.csv',header=None,dtype=str) 
Result = Data.groupby(1).apply(lambda y: ','.join([s1 for s2 in y.values for s1 in s2])) 
f = open('Result.csv','w') 
for r in Result: 
    f.write(r+'\n') 
f.close() 

输出存储在Result.csv