2016-04-26 28 views
3

我试着在我的代码应用PCA,当我使用下面的代码训练我的数据:scikit学习PCA变换返回不正确的减少长篇

def gather_train(): 
    train_data = np.array([]) 
    train_labels = np.array([]) 
    with open(training_info, "r") as traincsv: 
     for line in traincsv: 
      current_image = "train\\{}".format(line.strip().split(",")[0]) 
      print "Reading data from: {}".format(current_image) 
      train_labels = np.append(train_labels, int(line.strip().split(",")[1])) 
      with open(current_image, "rb") as img: 
       train_data = np.append(train_data, np.fromfile(img, dtype=np.uint8).reshape(-1, 784)/255.0) 
    train_data = train_data.reshape(len(train_labels), 784) 
    return train_data, train_labels 

def get_PCA_train(data): 
    print "\nFitting PCA. Components: {} ...".format(PCA_components) 
    pca = decomposition.PCA(n_components=PCA_components).fit(data) 
    print "\nReducing data to {} components ...".format(PCA_components) 
    data_reduced = pca.fit_transform(data) 
    return data_reduced 

def get_PCA_test(data): 
    print "\nFitting PCA. Components: {} ...".format(PCA_components) 
    pca = decomposition.PCA(n_components=PCA_components).fit(data) 
    print "\nReducing data to {} components ...".format(PCA_components) 
    data_reduced = pca.transform(data) 
    return data_reduced 

def gather_test(imgfile): 
    #input is a file, and reads data from it. different from gather_train which gathers all at once 
    with open(imgfile, "rb") as img: 
     return np.fromfile(img, dtype=np.uint8,).reshape(-1, 784)/255.0 

... 

train_data = gather_train() 
train_data_reduced = get_PCA_train(train_data) 
print train_data.ndim, train_data.shape 
print train_data_reduced.ndim, train_data_reduced.shape 

它打印出的FF,预计:

2 (1000L, 784L) 
2 (1000L, 300L) 

但是,当我开始减少我的测试数据:

test_data = gather_test(image_file) 
# image_file is 784 bytes (28x28) of pixel values; 1 byte = 1 pixel value 
test_data_reduced = get_PCA_test(test_data) 
print test_data.ndim, test_data.shape 
print test_data_reduced.ndim, test_data_reduced.shape 

输出为:

2 (1L, 784L) 
2 (1L, 1L) 

这会导致错误以后:

ValueError: X.shape[1] = 1 should be equal to 300, the number of features at training time

为什么test_data_reduced形状(1,1)的,不是(1,300)?我曾尝试使用fit_transform作为训练数据,而transform仅用于测试数据,但仍然是相同的错误。

+1

你的数据是什么样的,你可以发布一些模型吗?您应用PCA错误,但您应该对训练数据进行fit_transform,然后转换测试数据。当您重新测试测试数据时,您基本上忽略了您的训练数据。此外,你应该发布更完整的代码 - 你如何定义train_data和test_data? – flyingmeatball

+0

什么@flyingmeatball是正确的,这是因为您正在对您的PCA模型进行再训练以测试数据。 – ncfirth

+0

@flyingmeatball我添加了更多的代码。这里的流程是'train_data'和'test_data'类似,只有'test_data'是单个条目 – jowayow

回答

1

的调用PCA具有大致是这样的:

pca = decomposition.PCA(n_components=PCA_components).fit(train_data) 
data_reduced = pca.transform(test_data) 

首先调用fit训练数据,然后transform的测试数据,你想减少。