2016-11-23 24 views
0

我从一个CSV文件中获取数据,使用data = numpy.recfromtxt('table.csv', delimiter=';', dtype=str)Python的很长的字符串在numpy的阵列

表看起来是这样的:

Name; Birthdate; Biography 
John; 1990; Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo intuens debet institutum illud quasi signum absolvere. Scrupulum, inquam, abeunti; Quae diligentissime contra Aristonem dicuntur a Chryippo. Quo tandem modo? 

Python和NumPy的似乎与此长的问题字符串。 任何想法如何解决这个问题?

+3

你说的是什么样的_problems_的?你应该澄清一点。 – Lafexlos

+0

'recfromtxt'使用更常见的'genfromtxt'。第一行有2个分隔符。第二个有3.你期望有多少个领域? – hpaulj

回答

1

您可以使用Python的pandas包。

下面是使用它的一个简单的想法:

import pandas as pd 

data = pd.read_csv("file.csv", delimiter = ";") 

希望这是你想要的...

+0

这是什么产生的? – hpaulj

0

请使用熊猫包从CSV阅读

import pandas as pd 
    data = pd.read_csv('table.csv') 

熊猫能处理长字符串也是如此。

0

我没有问题阅读,所以也许你的问题可能是关于格式化适合打印的方式。这里有几个选项。

>>> import textwrap 
>>> a = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo intuens debet institutum illud quasi signum absolvere. Scrupulum, inquam, abeunti; Quae diligentissime contra Aristonem dicuntur a Chryippo. Quo tandem modo?" 
>>> txt = textwrap.wrap(a, width=70) 
>>> print(("{}\n"*len(txt)).format(*txt)) 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo 
intuens debet institutum illud quasi signum absolvere. Scrupulum, 
inquam, abeunti; Quae diligentissime contra Aristonem dicuntur a 
Chryippo. Quo tandem modo? 

或许这一个...

>>> txt2 = "\n".join([i for i in txt]) 
>>> print(txt2) 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo 
intuens debet institutum illud quasi signum absolvere. Scrupulum, 
inquam, abeunti; Quae diligentissime contra Aristonem dicuntur a 
Chryippo. Quo tandem modo? 
>>>  
0

的错误是:

In [67]: np.recfromtxt('stack40765849.txt', delimiter=';', dtype=str) 
--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-67-eab6d3192d4d> in <module>() 
----> 1 np.recfromtxt('stack40765849.txt', delimiter=';', dtype=str) 

/usr/lib/python3/dist-packages/numpy/lib/npyio.py in recfromtxt(fname, **kwargs) 
    1949  kwargs.setdefault("dtype", None) 
    1950  usemask = kwargs.get('usemask', False) 
-> 1951  output = genfromtxt(fname, **kwargs) 
    1952  if usemask: 
    1953   from numpy.ma.mrecords import MaskedRecords 
... 
ValueError: Some errors were detected ! 
    Line #2 (got 4 columns instead of 3) 

(注意,recfromtxt是使用genfromtxt,它讨论了很多

问题不在于字符串的长度,而在于分隔符的数量。第一行(一个heade r?)有2个,表明你想要3列或者字段。但第二行有3个;额外的可能是文本的一部分。

识别第一行的字段名称会导致相同的错误。

np.recfromtxt('stack40765849.txt', delimiter=';', dtype=str,names=True) 

pandas负载的情况下为:

In [74]: data=pandas.read_csv('stack40765849.txt',delimiter=';') 
In [75]: data 
Out[75]: 
     Name           Birthdate \ 
John 1990 Lorem ipsum dolor sit amet, consectetur adipi... 

               Biography 
John Quae diligentissime contra Aristonem dicuntur... 

它不给一个错误,但它看起来不正确。

==================

如果我在文本改变;.

In [82]: np.genfromtxt('stack40765849_1.txt', delimiter=';', dtype=None,names=Tr 
    ...: ue) 
Out[82]: 
array((b'John', 1990, b' Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo intuens debet institutum illud quasi signum absolvere. Scrupulum, inquam, abeunti. Quae diligentissime contra Aristonem dicuntur a Chryippo. Quo tandem modo?'), 
     dtype=[('Name', 'S4'), ('Birthdate', '<i4'), ('Biography', 'S225')]) 

我得到一个结构数组(几乎像一个recarray)与3个领域;最后是很长的 - 全文。 (b'...'表示Py3中的一个字节字符串;它不会出现在Py2显示中。)

pandas产生类似的东西:

In [83]: data=pandas.read_csv('stack40765849_1.txt',delimiter=';') 
In [84]: data 
Out[84]: 
    Name Birthdate           Biography 
0 John  1990 Lorem ipsum dolor sit amet, consectetur adipi... 

正确PY3 unicode的负荷:

In [91]: np.recfromtxt('stack40765849_1.txt', delimiter=';', dtype='U4,i,U255',n 
    ...: ames=True) 
Out[91]: 
rec.array(('John', 1990, ' Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo intuens debet institutum illud quasi signum absolvere. Scrupulum, inquam, abeunti. Quae diligentissime contra Aristonem dicuntur a Chryippo. Quo tandem modo?'), 
      dtype=[('Name', '<U4'), ('Birthdate', '<i4'), ('Biography', '<U255')]) 
In [92]: