将列添加到python中的数据集中

我想将预测的数据添加回到我在Python中的原始数据集中。我想我应该使用Pandas和ASSIGN以及pd.DataFrame，但是在阅读完所有文档后，我不知道该如何编写这个代码（对不起，我是新手，刚开始学习编码）。我已经在下面编写了我的代码，只需要代码的帮助即可将我的预测添加回数据集。谢谢您的帮助！将列添加到python中的数据集中

# Importing the libraries 
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 

# Importing the dataset 
dataset = pd.read_csv('Social_Network_Ads.csv') 
X = dataset.iloc[:, [2, 3]].values 
y = dataset.iloc[:, 4].values 

# Splitting the dataset into the Training set and Test set 
from sklearn.cross_validation import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,  
random_state = 0) 

# Feature Scaling X_train and X_test 
from sklearn.preprocessing import StandardScaler 
sc = StandardScaler() 
X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test) 

#Feature scaling the all independent variables used to build the model 
whole_dataset = sc.transform(X) 

# Fitting classifier to the Training set 
# Create your Naive Bayes here 
from sklearn.naive_bayes import GaussianNB 
classifier = GaussianNB() 
classifier.fit(X_train, y_train) 

# Predicting the Test set results 
y_pred = classifier.predict_proba(X_test) 

# Predicting the results for the whole dataset 
y_pred2 = classifier.predict_proba(whole_dataset) 

# Add y_pred2 predictions back to the dataset 
???

来源

2017-06-15 zipline86

我想现在看着你想要做的事情，你误解了正在发生的事情。您已将数据集分成一列火车和测试数据。然后，您在训练数据集上进行训练，然后对测试数据进行拟合。然后，您尝试将原始数据集分配到所有行。例如，你在数据集中有400行，但在y_pred中只有100行，所以你不能分配不同长度的行。你想要做的是'y_pred = classifier.predict_proba（X）'，然后将其分配给：'dataset ['predict_class_1']，dataset ['predict_class_2'] = y_pred [：，0]，y_pred [：，1] ' – EdChum

非常感谢，我会尝试一下！ :)我将代码稍微改了一点，现在可以预测400行。我无法在这里上传数据文件，但可以在https://www.superdatascience.com/machine-learning/第18节naive bayes zip文件中下载。该csv文件被称为Social_Network_Ads.csv。我希望我能得到它的工作:) – zipline86

@EdChum它的工作！谢谢！ – zipline86

你可以只做dataset['prediction'] = y_pred添加一个新列。

Pandas支持添加新列的简单语法，在这里它将添加一个新列，并且可能会从sklearn返回的numpy数组上看到一个视图，所以它应该很好并且很快。在你的代码和数据

编辑

看，你误会什么train_test_split呢，这是分裂的数据到原始数据集，其具有400行的3/4 1/4分裂您X列车数据包含300行，测试数据为100行。然后，您尝试将您的原始数据集分配回400行。首先行数不匹配，其次从predict_proba返回的是预测类的百分比矩阵。所以，你要训练后做什么是预测对原始数据集和子选择每列指定这个早在2列：

y_pred = classifier.predict_proba(X)

现在，将这个回：

dataset['predict_class_1'],dataset['predict_class_2'] = y_pred[:,0],y_pred[:,1]

来源

2017-06-15 08:42:25 EdChum

我试过了，但后来我得到了这个错误ValueError：错误数量的项目通过2，安置意味着1.任何想法，为什么发生这种情况？谢谢！ – zipline86

您需要将原始数据和代码添加到您的问题中，以便我们重现此问题 – EdChum

有几种解决方案The answer of EdChurm已经提到过一个。据我所知，熊猫有其他两种方法可以使用它。

因为你没在使用中提供的数据，这里是一个很简单的例子。

import pandas as pd 
import numpy as np 
np.random.seed(1) 
df = pd.DataFrame(np.random.randn(10), columns=['raw']) 
df = df.assign(cube_raw=df['raw']**2) 
df.insert(1,'square_raw',df['raw']**3) 

df 
      raw square_raw  cube_raw 
0 1.624345 2.638498  4.285832 
1 -0.611756 0.374246 -0.228947 
2 -0.528172 0.278965 -0.147342 
3 -1.072969 1.151262 -1.235268 
4 0.865408 0.748930  0.648130 
5 -2.301539 5.297080 -12.191435 
6 1.744812 3.044368  5.311849 
7 -0.761207 0.579436 -0.441071 
8 0.319039 0.101786  0.032474 
9 -0.249370 0.062186 -0.015507

只要记住，df.assign()不就地工作，所以你应该重新分配给你的一个变量。

在我看来，我最喜欢df.insert()，因为它允许你指定你想插入的位置。（带参数loc）

来源

2017-06-15 09:21:53 CDtoday

我尝试过创建df = dataset，然后df.assign（y_pred），但后来得到了此TypeError：assign（）需要1个位置参数但有2个。任何想法为什么我可以解决这个问题？谢谢！ – zipline86

@ zipline86'df.assign（）'的格式应该像'df.assign（_varname_ = content）'。您可能想要查看答案中的链接以获取更多详细信息。 – CDtoday

将列添加到python中的数据集中

回答

相关问题