2014-01-09 36 views
3

我有一个奇数csv文件全髋关节置换具有头值并在如以下的方式与其对应的数据的数据转换为另一种csv文件:提取从CSV奇怪排列数据和使用python

,,,Completed Milling Job,,,,,, # row 1 

,,,,Extended Report,,,,, 

,,Job Spec numerical control,,,,,,, 

Job Number,3456,,,,,, Operator Id,clipper, 

Coder Machine Name,Caterpillar,,,,,,Job Start time,3/12/2013 6:22, 

Machine type,Stepper motor,,,,,,Job end time,3/12/2013 9:16, 

我需要提取从这个strucutre数据创建另一个csv文件按如下结构:

Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time,,, # header 
Completed Milling Job,3456,Caterpillar,Stepper motor,clipper,3/12/2013 6:22,3/12/2013 9:16,,, # data row 

如果你注意到,有一个新的标题栏添加了所谓的“地位”,但值是CSV文件的第一排。输出文件中的其余列名是从原始文件中提取的文件中。

任何想法,将不胜感激 - 感谢

+0

原始文件格式如下: – user3130236

+0

原始文件中是否有多个作业或每个作业是否有单独的文件? – mmdanziger

+0

每个作业都有单独的文件。所以我想要提取的只是该文件的一行 – user3130236

回答

0

假设文件都是完全一样的(至少在盖帽方面)这应该工作,虽然我只能保证它在您提供的确切的数据:

#!/usr/bin/python 
import glob 
from sys import argv 

g=open(argv[2],'w') 
g.write("Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time\n") 
for fname in glob.glob(argv[1]): 
    with open(fname) as f: 
     status=f.readline().strip().strip(',') 
     f.readline()#extended report not needed 
     f.readline()#job spec numerical control not needed 
     s=f.readline() 
     job_no=s.split('Job Number,')[1].split(',')[0] 
     op_id=s.split('Operator Id,')[1].strip().strip(',') 
     s=f.readline() 
     machine_name=s.split('Coder Machine Name,')[1].split(',')[0] 
     start_t=s.split('Job Start time,')[1].strip().strip(',') 
     s=f.readline() 
     machine_type=s.split('Machine type,')[1].split(',')[0] 
     end_t=s.split('Job end time,')[1].strip().strip(',') 
    g.write(",".join([status,job_no,machine_name,machine_type,op_id,start_t,end_t])+"\n") 
g.close() 

它需要一个水珠参数(如Job*.data)和一个输出文件名,并应建立你所需要的。只需将它保存为'so.py'或其他东西,然后将其作为python so.py <data_files_wildcarded> output.csv

+0

非常感谢。我会试试这个代码 – user3130236

+0

如果你发现答案有用,请点击复选标记来点赞和/或接受它。 – mmdanziger

+0

肯定..绝对..谢谢 – user3130236

0

以下解决方案适用于任何与所显示的模式相同的CSV文件。这是一个严重恶劣的格式。

我对这个问题很感兴趣,并在我的午休时间里对它进行了处理。代码如下:

COMMA = ',' 
NEWLINE = '\n' 

def _kvpairs_from_line(line): 
    line = line.strip() 
    values = [item.strip() for item in line.split(COMMA)] 

    i = 0 
    while i < len(values): 
     if not values[i]: 
      i += 1 # advance past empty value 
     else: 
      # yield pair of values 
      yield (values[i], values[i+1]) 
      i += 2 # advance past pair 

def kvpairs_by_column_then_row(lines): 
    """ 
    Given a series of lines, where each line is comma-separated values 
    organized as key/value pairs like so: 
     key_1,value_1,key_n+1,value_n+1,... 
     key_2,value_2,key_n+2,value_n+2,... 
     ... 
     key_n,value_n,key_n+n,value_n+n,... 

    Yield up key/value pairs taken from the first column, then from the second column 
    and so on. 
    """ 
    pairs = [_kvpairs_from_line(line) for line in lines] 
    done = [False for _ in pairs] 
    while not all(done): 
     for i in range(len(pairs)): 
      if not done[i]: 
       try: 
        key_value_tuple = next(pairs[i]) 
        yield key_value_tuple 
       except StopIteration: 
        done[i] = True 

STATUS = "Status" 
columns = [STATUS] 

d = {} 

with open("data.csv", "rt") as f: 
    # get an iterator that lets us pull lines conveniently from file 
    itr = iter(f) 

    # pull first line and collect status 
    line = next(itr) 
    lst = line.split(COMMA) 
    d[STATUS] = lst[3] 

    # pull next lines and make sure the file is what we expected 
    line = next(itr) 
    assert "Extended Report" in line 
    line = next(itr) 
    assert "Job Spec numerical control" in line 

    # pull all remaining lines and save in a list 
    lines = [line.strip() for line in f] 

for key, value in kvpairs_by_column_then_row(lines): 
    columns.append(key) 
    d[key] = value 

with open("output.csv", "wt") as f: 
    # write column headers line 
    line = COMMA.join(columns) 
    f.write(line + NEWLINE) 
    # write data row 
    line = COMMA.join(d[key] for key in columns) 
    f.write(line + NEWLINE)