2016-06-17 59 views
1

我有这样一个数据文件:如何将数据文件拆分为多个部分以及每个拆分文件中的注释?

# coating file for detector A/R 
# column 1 is the angle of incidence (degrees) 
# column 2 is the wavelength (microns) 
# column 3 is the transmission probability 
# column 4 is the reflection probability 
     14.2000  0.531000 0.0618000  0.938200 
     14.2000  0.532000 0.0790500  0.920950 
     14.2000  0.533000 0.0998900  0.900110 
# it has lots of other lines 
# datafile can be obtained from pastebin 

输入数据文件的链接是: http://pastebin.com/NaNbEm3E

我想从这个输入创建20个文件,每个文件有意见一致。

即:

#out1.txt 
#comments 
    first part of one-twentieth data 

# out2.txt 
# given comments 
    second part of one-twentieth data 

# and so on upto out20.txt 

我们怎样才能在Python这样做呢?

我的初使尝试是这样的:

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 
# Author : Bhishan Poudel 
# Date  : May 23, 2016 


# Imports 
from __future__ import print_function 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

# read in comments from the file 
infile = 'filecopy_multiple.txt' 
outfile = 'comments.txt' 
comments = [] 
with open(infile, 'r') as fi, open (outfile, 'a') as fo: 
    for line in fi.readlines(): 
     if line.startswith('#'): 
      comments.append(line) 
      print(line) 
      fo.write(line) 


#============================================================================== 
# read in a file 
# 
infile = infile 
colnames = ['angle', 'wave','trans','refl'] 
print('{} {} {} {}'.format('\nreading file : ', infile, '','')) 
df = pd.read_csv(infile,sep='\s+', header = None,skiprows = 0, 
       comment='#',names=colnames,usecols=(0,1,2,3)) 
print('{} {} {} {}'.format('length of df : ', len(df),'','')) 


# write 20 files 
df = df 
nfiles = 20 
nrows = int(len(df)/nfiles) 
groups = df.groupby( np.arange(len(df.index))/nrows ) 
for (frameno, frame) in groups: 
    frame.to_csv("output_%s.csv" % frameno,index=None, header=None,sep='\t') 

到现在我有二十劈裂文件。我只想将评论行复制到每个文件。但问题是:how to do so?

应该有一些更容易的方法比创建另外20个输出文件与仅评论和追加twenty_splitted_files给他们。

一些有用的链接如下:
How to split a dataframe column into multiple columns
How to split a DataFrame column in python
Split a large pandas dataframe

+0

这不是很清楚为什么你需要大熊猫/数据帧在这种情况下...你想保持现有的文件格式,或者你想保存splited文件作为正常CSV或HDF5文件? – MaxU

+0

@MaxU我想将分割文件保存为正常的CSV文件,以便每个二十个输出文件具有与输入文件相同的头部注释。 –

+0

您的原始CSV文件是否适合内存,或者您是否必须逐行读取它? – MaxU

回答

2

UPDATE:优化代码

fn = r'D:\download\input.txt' 

with open(fn, 'r') as f: 
    data = f.readlines() 

comments_lines = 0 
for line in data: 
    if line.strip().startswith('#'): 
     comments_lines += 1 
    else: 
     break 

nfiles = 20 
chunk_size = (len(data)-comments_lines)//nfiles 

for i in range(nfiles): 
    with open('d:/temp/output_{:02d}.txt'.format(i), 'w') as f: 
     f.write(''.join(data[:comments_lines] + data[comments_lines+i*chunk_size:comments_lines+(i+1)*chunk_size])) 
     if i == nfiles - 1 and len(data) > comments_lines+(i+1)*chunk_size: 
      f.write(''.join(data[comments_lines+(i+1)*chunk_size:])) 

原来的答复:

comments = [] 
data = [] 

with open('input.txt', 'r') as f: 
    data = f.readlines() 

i = 0 
for line in data: 
     if line.strip().startswith('#'): 
      comments.append(line) 
      i += 1 
     else: 
      break 

data[:] = data[i:] 

i=0 
for x in range(0, len(data), len(data)//20): 
    with open('output_{:02d}.txt'.format(i), 'w') as f: 
     f.write(''.join(comments + data[x:x+20])) 
     i += 1 
+0

回溯(最近通话最后一个): 文件 “split_file_with_comments.py” 25行,在 数据= [线] + f.readlines() ValueError异常:混合迭代和阅读的方法将丢失数据 –

+0

@BhishanPoudel,我我测试了它在Python3下,让我测试它在Python2下... – MaxU

+0

@MaxU_我使用macos 10.9,这段代码显示了python2和python3的相同错误,我只是删除了D:\下载\和d:/ temp/names。 python3再次显示VALUE_ERROR –

2

这应该做到这一点

# Store comments in this to use for all files 
comments = [] 

# Create a new sub list for each of the 20 files 
data = [] 
for _ in range(20): 
    data.append([]) 

# Track line number 
index = 0 

# open input file 
with open('input.txt', 'r') as fi: 
    # fetch all lines at once so I can count them. 
    lines = fi.readlines() 

    # Loop to gather initial comments 
    line = lines[index] 
    while line.split()[0] == '#': 
     comments.append(line) 
     index += 1 
     line = lines[index] 

    # Calculate how many lines of data 
    numdata = len(lines) - len(comments) 

    for i in range(index, len(lines)): 
     # Calculate which of the 20 files I'm working with 
     filenum = (i - index) * 20/numdata 
     # Append line to appropriately tracked sub list 
     data[filenum].append(lines[i]) 

for i in range(1, len(data) + 1): 
    # Open output file 
    with open('output{}.txt'.format(i), 'w') as fo: 
     # Write comments 
     for c in comments: 
      fo.write(c) 
     # Write data 
     for line in data[i-1]: 
      fo.write(line) 
+0

@piRSquared_非常感谢。 –

相关问题