根据不同的值创建新列并计数它们

对不起，如果标题不够清楚。让我解释我想达到的目标。根据不同的值创建新列并计数它们

我有这个数据框，我们称之为df。

id | Area 
A one 
A two 
A one 
B one 
B one 
C one 
C two 
D one 
D one 
D two 
D three

我想根据现有数据框中的值创建一个新的数据框。首先，我想在df中找到总共不同的id。防爆。 ID A有3个条目，B有2个条目等，然后创建一个新的数据框。

对于我们新的数据帧，姑且称之为df_new

id | count 
A 3 
B 2 
C 2 
D 4

接下来，我想基于在DF [“区”]值来创建一个新列，在这个例子中，DF [”区域']包含3个不同的值（一，二，三）。我想统计某个ID在哪个区域中的次数。例如，ID A已经在区域一中两次，一次在区域二中，在三区域中为零。然后，我会将这些值附加到一个称为1,2和3的新列中。

df_new：

id | count | one | two | three 
A 3  2  1  0 
B 2  2  0  0 
C 2  1  1  0 
D 4  2  1  1

我已经开发了自己的代码产生df_new，但是我相信，大熊猫具有更好的功能来执行这种数据提取的。这是我的代码。

#Read the data 
df = pd.read_csv('test_data.csv', sep = ',') 
df.columns = ['id', 'Area'] #Rename 
# Count a total number of Area by Id 
df_new = pd.DataFrame({'count' : df.groupby("id")["Area"].count()}) 
# Reset index 
df_new = df_new.reset_index() 
#For loop for counting and creating a new column for areas in df['Area'] 
for i in xrange(0, len(df)): 
    #Get the id 
    idx = df['id'][i] 
    #Get the areaname 
    area_name = str(df["Area"][i]) 
    #Retrieve the index of a particular id 
    current_index = df_new.loc[df_new['id'] == idx, ].index[0] 
    #If area name exists in a column 
    if area_name in df_new.columns: 
     #Then +1 at the Location of the idx (Index) 
     df_new[area_name][current_index] += 1 
    #If not exists in the columns 
    elif area_name not in df_new.columns: 
     #Create an empty one with zeros 
     df_new[area_name] = 0 
     #Then +1 at the location of the idx (Index) 
     df_new[area_name][current_index] += 1

代码很长，很难阅读。它也遭受警告：“一个值试图在DataFrame的一个片段的副本上设置”。我想了解更多有关如何有效编写此内容的信息。

谢谢

来源

2017-08-22 Niche.P

可以使用df.groupby.count用于为第二，第一部分和pd.crosstab。然后，使用pd.concat加入EM：

In [1246]: pd.concat([df.groupby('id').count().rename(columns={'Area' : 'count'}),\ 
         pd.crosstab(df.id, df.Area)], 1) 
Out[1246]: 
    count one three two 
id       
A  3 2  0 1 
B  2 2  0 0 
C  2 1  0 1 
D  4 2  1 1

下面是一个使用df.groupby第一部分：

df.groupby('id').count().rename(columns={'Area' : 'count'}) 

    count 
id  
A  3 
B  2 
C  2 
D  4

这里的第二部分与pd.crosstab：

pd.crosstab(df.id, df.Area) 

Area one three two 
id     
A  2  0 1 
B  2  0 0 
C  1  0 1 
D  2  1 1

对于第二部分，你也可以使用pd.get_dummies并做一个点积：

(pd.get_dummies(df.id).T).dot(pd.get_dummies(df.Area)) 

    one three two 
A 2  0 1 
B 2  0 0 
C 1  0 1 
D 2  1 1

来源

2017-08-22 02:49:09

哦哇，真是太棒了。谢谢，我会在7分钟内提供您的答案。 –

还有一个问题，是否可以使用交叉表生成二进制数而不是计数？取而代之的是，如果某个ID已经去过那个区域，那么只有1，而某个ID的0从来没有去过那里？ –

@ Niche.P好的，我明白了。它是：'pd.crosstab（df.id，df.Area）.astype（bool）.astype（int）' –

根据不同的值创建新列并计数它们

回答

相关问题