在文件“ratings.dat”上运行下面的代码时,我遇到“ValueError”。我已经在另一个带有“,”的文件上尝试了相同的代码作为分隔符,没有任何问题。然而,当分隔符是“::”时,熊猫似乎失败了。Pandas在“::”分隔符上的read_csv中的值错误
我输入的代码错了吗?
代码:
import pandas as pd
import numpy as np
r_cols = ['userId', 'movieId', 'rating']
r_types = {'userId': np.str, 'movieId': np.str, 'rating': np.float64}
ratings = pd.read_csv(
r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\'
r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\ratings.dat',
sep='::', names=r_cols, usecols=range(3), dtype=r_types
)
m_cols = ['movieId', 'title']
m_types = {'movieId': np.str, 'title': np.str}
movies = pd.read_csv(
r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\'
r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\movies.dat',
sep='::', names=m_cols, usecols=range(2), dtype=m_types
)
ratings = pd.merge(movies, ratings)
ratings.head()
“ratings.dat”
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
1::595::5::978824268
错误输出:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-19-a2649e528fb9> in <module>()
7 r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\'
8 r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\ratings.dat',
----> 9 sep='::', names=r_cols, usecols=range(3), dtype=r_types
10 )
11
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
496 skip_blank_lines=skip_blank_lines)
497
--> 498 return _read(filepath_or_buffer, kwds)
499
500 parser_f.__name__ = name
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
273
274 # Create the parser.
--> 275 parser = TextFileReader(filepath_or_buffer, **kwds)
276
277 if (nrows is not None) and (chunksize is not None):
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
584
585 # might mutate self.engine
--> 586 self.options, self.engine = self._clean_options(options, engine)
587 if 'has_index_names' in kwds:
588 self.options['has_index_names'] = kwds['has_index_names']
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _clean_options(self, options, engine)
663 msg += " (Note the 'converters' option provides"\
664 " similar functionality.)"
--> 665 raise ValueError(msg)
666 del result[arg]
667
ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators, but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.)
一个数据记录是通过一个“:”数据字段内。因此,Python不断抛出一个C错误:“第12行预期的5个字段,看到6”。无论如何处理这个? – Cloud
那时候,我可能会在文本编辑器中打开数据文件,看看是否有例如文件中的任何逗号或分号,然后用','替换全部为'::'。当然,我可以访问该文件。 – Evert
@Cloud你可能想问一下你在Pandas邮件列表中的情况(如何避免''::''被解释为一个正则表达式;当我测试这个时反斜杠不起作用或者在Pandas github上提出问题页。 – Evert