2016-11-27 71 views
-1

我训练创建推荐系统。我从网站获取数据http://grouplens.org/datasets/movielens/指数5688超出范围为0轴的大小为3706

import numpy as np 
import pandas as pd 
header = ['user_id', 'item_id', 'rating', 'timestamp'] 
df = pd.read_csv('ml-1m/ratings.dat', sep='::', names=header) 
n_users = df.user_id.unique().shape[0] 
n_items = df.item_id.unique().shape[0] 
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)) 

用户数= 6040 |电影的数量= 3706

from sklearn import cross_validation as cv 
train_data, test_data = cv.train_test_split(df, test_size=0.25) 

,我尝试建立两个用户 - 项目矩阵,一个用于训练,而另一个用于测试

train_data_matrix = np.zeros((n_users, n_items)) 
for line in train_data.itertuples(): 
    train_data_matrix[line[1]-1, line[2]-1] = line[3] 

test_data_matrix = np.zeros((n_users, n_items)) 
for line in test_data.itertuples(): 
    test_data_matrix[line[1]-1, line[2]-1] = line[3] 

,我得到(全回溯)

IndexError        Traceback (most recent call last) 
<ipython-input-39-180dea01cdf8> in <module>() 
     2 train_data_matrix = np.zeros((n_users, n_items)) 
     3 for line in train_data.itertuples(): 
----> 4  train_data_matrix[line[1]-1, line[2]-1] = line[3] 
     5 
     6 test_data_matrix = np.zeros((n_users, n_items)) 

IndexError: index 5688 is out of bounds for axis 0 with size 3706 

有什么不对?

P.S.

train_data.head() 
     user_id item_id rating  timestamp 
483019 2968 2268 5  971107926 
943582 5689 3615 3  963719230 
116153 752  1147 5  975458000 
103250 686  1704 5  975601762 
235333 1425 3752 4  1023560349 

PSS

for line in train_data.itertuples(): 
    print (line) 
Pandas(Index=483019, user_id=2968, item_id=2268, rating=5, timestamp=971107926) 
Pandas(Index=943582, user_id=5689, item_id=3615, rating=3, timestamp=963719230) 
Pandas(Index=116153, user_id=752, item_id=1147, rating=5, timestamp=975458000) 
Pandas(Index=103250, user_id=686, item_id=1704, rating=5, timestamp=975601762) 

回答

0

错误消息告诉我们,train_data_matrix具有形状(3706,N),而line[1]-1是5688.

IndexError: index 5688 is out of bounds for axis 0 with size 3706 
train_data_matrix[line[1]-1, line[2]-1] = line[3] 

所以,问题是 - 这是为什么是line[1]等于5689?或在更大的背景下,为什么用这个值大train_data.itertuples()生产线?

我想知道你是否应该改为使用

train_data_matrix[line[0]-1, line[1]-1] 

我不熟悉itertuples。什么是line的要素是什么?什么是train_data完整形状?

+0

train_data_matrix - 唯一值用户与电影的id的矩阵。 5689 - 这是用户的ID train_data.head() – Edward

+0

我回答了我的问题 – Edward

+0

但矩阵的行由行数,而不是用户ID索引。 – hpaulj

相关问题