2015-12-12 22 views
1

我有一个数据文件,其中一些内容如下所示。数据是空间separated.But空间是不一样的......在Python中阅读包含缺失项目的列

AAA B  C D E F G H I J 
AAA B  C D E F G H I J 
AAA B  C D E F G H I J 

我用

AAA,B,C,D,E,F,G,H,I = line.split() 

读取数据。

最近我获得新的数据有时缺少的列d和/或I和/或J.
列是类同:

AAA B C D E F G H I J 
AAA B C  E F G H  J 
AAA B C  E F G H    

所有的数据对我来说重要的是B,E,F和G列。我不能使用line.split(),因为左侧的变量正在改变。可以重写脚本来读取所有输入数据的情况?任何建议?

+0

于是,缺少数据线具有其中的数据应该有一个空格,或没有空间,而下一个列移动到左侧? –

+0

文件的格式是什么?逗号分隔或制表符分隔,看起来像? – Llopis

+0

如果有空格作为分隔符,则无法执行此操作。 –

回答

1

你可以用大熊猫或numpy的的CSV阅读能力:

import numpy as np 
data = np.genfromtxt(
    'data.txt', 
    missings_values=['-', ], 
    names=['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'] 
) 
print(data['AAA']) 

或者大熊猫:

import pandas as pd 
data = pd.read_csv(
    'data.txt', 
    sep='\S+', 
    na_values='-', 
    names=['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], 
) 

print(data['AAA']) 
+0

数据以空格分隔。破折号将显示丢失的数据。如前所示,所有剩余的数据都用确切的空格分隔。 –

+0

是的。对于numpy,空白分隔是默认的,对于熊猫来说,正则表达式'\ S +'将处理列分隔符。对于numpy,'missing_values = [' - ',]'将用NaN替换' - '。同样的熊猫'na_values' – MaxNoe

1

如果数据之间的空间量是固定的,丢失的数据仅仅是一个空间,你可以这样做:

>>> s="AAA B C   E F G H   J " 
>>> s.split(" ") 
['AAA', 'B', 'C', '', ' E', 'F', 'G', 'H', '', ' J '] 

编辑

假设之间的连续2个数据的空间中的所有文件不变,我给你这个

使这个文件为例:missing.txt

AAA B  C D E F G H I J 
AAA B  C D E F G H I J 
AAA B  C  E F G H  J 
AAA B  C  E F G H 

100 2  3 4 5 6 7 8 9 10 
100 2  3  5 6 7 8 9 10 
100 2  3  5 6 7 8  10 
100 2  3  5 6 7 8   

100.1 2.1  3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 
100.1 2.1  3.1  5.1 6.1 7.1 8.1 9.1 10.1 
100.1 2.1  3.1  5.1 6.1 7.1 8.1  10.1 
100.1 2.1  3.1  5.1 6.1 7.1 8.1   

hello this  is a example of a normal file right? 
hello this  is  example of a normal file right? 
hello this  is  example of a normal  right? 
hello this  is  example of a normal   

,并用此功能

def read_data_line(path_file, data_size=10, line_format=None, temp_char="@", ignore=True): 
    """Generator to read data_size data from a file that may have some missing 

     path_file: path to the file 
     line_format: list with the space between 2 consecutive data 
     temp_char: character that this function will use as placeholder for 
        the missing data during procesing 
     data_size: amount of data expected per line of the file 
     ignore:  in case that 'line_format' is not given, ignore all 
        lines that don't have the correct format, otherwise 
        is expected that the first line have the correct 
        format to use it a model for the rest of the file 

     Expected format of the content of the file: 
     A B  C D E F G H I J 

     with A,B,...,J strings without space or 'temp_char' or numbers 

     This function assume that the space between 2 consecutive 
     data is constant in all the file 

     usage 

     >>> datos = list(read_data_line("/some_folder/some_file.txt") 

     or 

     >>> for line in read_data_line("/some_folder/some_file.txt"): 
       print(line)""" 
    with open(path_file,"r") as data_raw: #this is the usual way of managing files 
     for line in data_raw: #here you read each line of the file one by one 
      datos = line.split() 
      if not line_format and len(datos)==data_size: #I have all the data, and I assume this structure is the norm 
       line = line.strip() 
       for d in datos: 
        line = line.replace(d,temp_char,1) 
       line_format = [ len(x) for x in line.split(temp_char)[1:-1] ] 
      if len(datos) < data_size: #missisng data 
       if line_format: 
        for t in line_format: 
         line = line.replace(" "*t,temp_char,1) 
        datos = list(map(str.strip,line.split(temp_char))) 
       else: 
        if ignore: 
         continue 
        raise RuntimeError("Imposible determinate the structure of file") 
      yield datos 

输出

>>> for x in read_data_line("missing.txt"): 
    print(x) 


['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'] 
['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'] 
['AAA', 'B', 'C', '', 'E', 'F', 'G', 'H', '', 'J'] 
['AAA', 'B', 'C', '', 'E', 'F', 'G', 'H'] 
[''] 
['100', '2', '3', '4', '5', '6', '7', '8', '9', '10'] 
['100', '2', '3', '', '5', '6', '7', '8', '9', '10'] 
['100', '2', '3', '', '5', '6', '7', '8', '', '10'] 
['100', '2', '3', '', '5', '6', '7', '8', '', ''] 
[''] 
['100.1', '2.1', '3.1', '4.1', '5.1', '6.1', '7.1', '8.1', '9.1', '10.1'] 
['100.1', '2.1', '3.1', '', '5.1', '6.1', '7.1', '8.1', '9.1', '10.1'] 
['100.1', '2.1', '3.1', '', '5.1', '6.1', '7.1', '8.1', '', '10.1'] 
['100.1', '2.1', '3.1', '', '5.1', '6.1', '7.1', '8.1', '', ''] 
[''] 
['hello', 'this', 'is', 'a', 'example', 'of', 'a', 'normal', 'file', 'right?'] 
['hello', 'this', 'is', '', 'example', 'of', 'a', 'normal', 'file', 'right?'] 
['hello', 'this', 'is', '', 'example', 'of', 'a', 'normal', '', 'right?'] 
['hello', 'this', 'is', '', 'example', 'of', 'a', 'normal', '', ''] 
>>> 

希望如果你有你的数据之间的间隔一致的号码和丢失的数据被替换为一个空格(如示例中)能够解决您的问题

+0

没有。这个空间并不是固定的...... –

+0

所以,你知道的唯一的事情是数据之间至少有一个空格,并且缺少数据没有像“ - ”这样的标记,对吧?至少该文件在所有行中的连续数据之间的空间量是一致的?我的意思是,AAA和B之间的空间总是相同的?B和C之间的空间?如果是的话,我可以做些什么... – Copperfield

+0

我做了一些调整,那该怎么办? – Copperfield

0

你仍然可以做一些非常相似:

a,_,b,_,c,_,d,_,e = "A B C E".split(' ') 

你会在每个字母之间为每个空格放置一个_。或者,如果您的缺失数据未用空格替换,请拆分每个字母之间的空格数,然后执行之前所做的操作(此示例适用于每个数据之间是否有3个空格):

AAA,B,C,D,E,F,G,H,I = line.split(' ') 

缺少的字母将填上'',这是两个并排的' '的结果。

+0

感谢大家 –

0

感谢您的回答,我找到了解决我的问题的办法。 由于数据的格式是带有固定列的列(例如%8。3f)我认为下一个代码是唯一可以做顶读取变量输入数据的。我不知道这是否是更好的解决方案。

data= "AAA B C D E F  G  H  I J 
     AAA B C  E F  G  H  I J 
     AAA B C  E F  G  H  " 
for line in data_raw.splitlines(): 
    aaa = line[0:2].strip() 
    b = line[4:6].strip() 
    c = line[7:10].strip() 
    d = line[11:14].strip() 
    e = line[15:16].strip() 
    f = line[17:20].strip() 
    g = line[21:26].strip() 
    h = line[27:32].strip() 
    i = line[37:38].strip() 
    j = line[39:40].strip() 
    print b, f,g,h 

输出:

B E F G 
B E F G 
B E F G 
+0

如果您的数据始终处于相同位置(无论缺失数据如何)并且始终是相同的字符长,即使是缺少的字符,精细... – Copperfield