2016-03-27 33 views
1

在文件“ratings.dat”上运行下面的代码时,我遇到“ValueError”。我已经在另一个带有“,”的文件上尝试了相同的代码作为分隔符,没有任何问题。然而,当分隔符是“::”时,熊猫似乎失败了。Pandas在“::”分隔符上的read_csv中的值错误

我输入的代码错了吗?

代码:

import pandas as pd 
import numpy as np 

r_cols = ['userId', 'movieId', 'rating'] 
r_types = {'userId': np.str, 'movieId': np.str, 'rating': np.float64} 
ratings = pd.read_csv(
     r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\' 
     r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\ratings.dat', 
     sep='::', names=r_cols, usecols=range(3), dtype=r_types 
     ) 

m_cols = ['movieId', 'title'] 
m_types = {'movieId': np.str, 'title': np.str} 
movies = pd.read_csv(
     r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\' 
     r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\movies.dat', 
     sep='::', names=m_cols, usecols=range(2), dtype=m_types 
     ) 

ratings = pd.merge(movies, ratings) 
ratings.head() 

“ratings.dat”

1::1287::5::978302039 
1::2804::5::978300719 
1::594::4::978302268 
1::919::4::978301368 
1::595::5::978824268 

错误输出:

---------------------------------------------------------------------------ValueError        Traceback (most recent call last)<ipython-input-19-a2649e528fb9> in <module>() 
     7   r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\' 
     8   r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\ratings.dat', 
----> 9   sep='::', names=r_cols, usecols=range(3), dtype=r_types 
    10  ) 
    11 
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines) 
    496      skip_blank_lines=skip_blank_lines) 
    497 
--> 498   return _read(filepath_or_buffer, kwds) 
    499 
    500  parser_f.__name__ = name 
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 
    273 
    274  # Create the parser. 
--> 275  parser = TextFileReader(filepath_or_buffer, **kwds) 
    276 
    277  if (nrows is not None) and (chunksize is not None): 
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds) 
    584 
    585   # might mutate self.engine 
--> 586   self.options, self.engine = self._clean_options(options, engine) 
    587   if 'has_index_names' in kwds: 
    588    self.options['has_index_names'] = kwds['has_index_names'] 
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _clean_options(self, options, engine) 
    663       msg += " (Note the 'converters' option provides"\ 
    664        " similar functionality.)" 
--> 665      raise ValueError(msg) 
    666     del result[arg] 
    667 
ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators, but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.) 

回答

3

如果你读了最后一行仔细回顾,你可能会得到答案,为什么它失败。我把它分成两行

ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators,

but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.)

所以分隔符'::'被解释为正则表达式。由于熊猫文档中关于sep说:

Regular expressions are accepted and will force use of the python parsing engine

(重点煤矿)

因此,大熊猫将使用“巨蟒引擎”来读取数据。错误的下一行然后说因为使用Python引擎,所以dtype被忽略。 (据推测,C-引擎意味着numpy的,它可以使用D型; Python的显然不应对dtypes。)


如何解决呢

您可以删除从dtype参数您致电read_csv(您仍然会收到警告),或者对分隔符进行操作。

第二个选项似乎很棘手:转义或原始字符串没有帮助。显然,任何超过1个字符的分隔符都被Pandas解释为正则表达式。这可能是熊猫方面的一个不幸的决定。

避免这一切的一种方法是使用单个':'作为分隔符,并避免每隔一个(空)列。例如:

ratings = pd.read_csv(filename, sep=':', names=r_cols, 
         usecols=[0, 2, 4], dtype=r_types) 

(或使用usecols=range(0, 5, 2)如果你在使用range设置。)


附录

的OP正确地提出了关于具有单一:字符场点。也许有办法解决,但除此之外,你可以把它一个两步走的方法,使用numpy的的genfromtxt代替:

# genfromtxt requires a proper numpy dtype, not a dict 
# for Python 3, we need U10 for strings 
dtype = np.dtype([('userId', 'U10'), ('movieID', 'U10'), 
        ('rating', np.float64)]) 
data = np.genfromtxt(filename, dtype=dtype, names=r_cols, 
        delimiter='::', usecols=list(range(3))) 
ratings = pd.DataFrame(data) 
+0

一个数据记录是通过一个“:”数据字段内。因此,Python不断抛出一个C错误:“第12行预期的5个字段,看到6”。无论如何处理这个? – Cloud

+0

那时候,我可能会在文本编辑器中打开数据文件,看看是否有例如文件中的任何逗号或分号,然后用','替换全部为'::'。当然,我可以访问该文件。 – Evert

+0

@Cloud你可能想问一下你在Pandas邮件列表中的情况(如何避免''::''被解释为一个正则表达式;当我测试这个时反斜杠不起作用或者在Pandas github上提出问题页。 – Evert

相关问题