2013-07-23 54 views
1

我具有其中的数据被格式化为CSV文件如下:组合多个CSV文件到一个单一的一个

file1.csv

ID,NAME 
001,Jhon 
002,Doe 

fille2.csv

ID,SCHOOLS_ATTENDED 
001,my Nice School 
002,His lovely school 

file3.csv

ID,SALARY 
001,25 
002,40 

ID字段是一种将用于获取记录的主键。

什么是读取3到4个文件并获取相应数据并存储在另一个具有标题(ID,NAME,SCHOOLS_ATTENDED,SALARY)的CSV文件中的最有效方式?

文件大小为几百MB(100,200 Mb)。

+0

为什么有人会downvote呢??? – Volatil3

+0

也许是因为它表明你缺乏研究工作?不过,这不是我。 –

+0

我认为这是一个重复的问题。在开新问题之前,你应该总是搜索它。顺便说一句,这不是我!http://stackoverflow.com/questions/17586573/python-combing-data-from-different-csv-files-into-one/17588521#17588521 –

回答

3

数百兆字节没有那么多。为什么使用不是去一个简单的方法的csv modulecollections.defaultdict

import csv 
from collections import defaultdict 

result = defaultdict(dict) 
fieldnames = {"ID"} 

for csvfile in ("file1.csv", "file2.csv", "file3.csv"): 
    with open(csvfile, newline="") as infile: 
     reader = csv.DictReader(infile) 
     for row in reader: 
      id = row.pop("ID") 
      for key in row: 
       fieldnames.add(key) # wasteful, but I don't care enough 
       result[id][key] = row[key] 

产生的defaultdict看起来是这样的:

>>> result 
defaultdict(<type 'dict'>, 
{'001': {'SALARY': '25', 'SCHOOLS_ATTENDED': 'my Nice School', 'NAME': 'Jhon'}, 
'002': {'SALARY': '40', 'SCHOOLS_ATTENDED': 'His lovely school', 'NAME': 'Doe'}}) 

然后,您可以合并到这一个CSV文件(不是我最漂亮的工作,但好够了):

with open("out.csv", "w", newline="") as outfile: 
    writer = csv.DictWriter(outfile, sorted(fieldnames)) 
    writer.writeheader() 
    for item in result: 
     result[item]["ID"] = item 
     writer.writerow(result[item]) 

out.csv则包含

ID,NAME,SALARY,SCHOOLS_ATTENDED 
001,Jhon,25,my Nice School 
002,Doe,40,His lovely school 
+0

谢谢你,但你的代码给错误** csv.Error:迭代器应该返回字符串,而不是字节(你是否在文本模式下打开文件?)*** – Volatil3

+1

@ Volatil3:我只注意到你在Python 3上;我已经编辑了相应的程序。请再试一次。 –

+0

我刚刚注意到分隔符是**〜** – Volatil3

0

以下是将多个csv文件与其名称中的特定关键字组合成1个最终csv文件的工作代码。我已经将default关键字设置为“file”,但是如果您想合并来自folder_path的所有csv文件,可以将其设置为空白。此代码将从您的第一个csv文件获取标题,并将其用作最终组合的csv文件中的标题。它会忽略所有其他csv文件的标题。

import glob,os 
@staticmethod 
def Combine_multiple_csv_files_thatContainsKeywordInTheirNames_into_one_csv_file(folder_path,keyword='file'): 
    #takes header only from 1st csv, all other csv headers are skipped and data is appened to final csv 

    fileNames = glob.glob(folder_path + "*" + keyword + "*"+".csv") # fileNames INCLUDES FOLDER_PATH TOO 
    with open(folder_path+"Combined_csv.csv", "w", newline='') as fout: 
     print('Combining multiple csv files into 1') 
     csv_write_file = csv.writer(fout, delimiter=',') 
     # a.writerows(op) 
     with open(fileNames[0], mode='rt') as read_file: # utf8 
      csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT 
      csv_write_file.writerows(csv_read_file) 

     for num in range(1, len(fileNames)): 
      with open(fileNames[num], mode='rt') as read_file: # utf8 
       csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT 
       next(csv_read_file) # ignore header 
       csv_write_file.writerows(csv_read_file) 
相关问题