2017-07-17 185 views
1

熊猫新手,抱歉,如果解决方案很明显。熊猫群大熊猫字典

我有一个数据帧(见下文)与不同的电影场景,对于电影中的场景

import pandas as pd 
data = [{'movie' : 'movie_X', 'scene' : '1', 'environment' : 'home'}, 
     {'movie' : 'movie_X', 'scene' : '2', 'environment' : 'car'}, 
     {'movie' : 'movie_X', 'scene' : '3', 'environment' : 'home'}, 
     {'movie' : 'movie_Y', 'scene' : '1', 'environment' : 'home'}, 
     {'movie' : 'movie_Y', 'scene' : '2', 'environment' : 'office'}, 
     {'movie' : 'movie_Z', 'scene' : '1', 'environment' : 'boat'}, 
     {'movie' : 'movie_Z', 'scene' : '2', 'environment' : 'beach'}, 
     {'movie' : 'movie_Z', 'scene' : '3', 'environment' : 'home' }] 
myDF = pd.DataFrame(data) 

环境。在这种情况下,电影有多个流派,他们属于哪个。我有一本字典(下),说明该类型属于

genreDict = {'movie_X' : ['romance', 'action'], 
      'movie_Y' : ['comedy', 'romance', 'action'], 
      'movie_Z' : ['horror', 'thriller', 'romance']} 

我想是myDF组通过这本字典每部电影,特别是能够告诉的次数特定的环境特定类型止跌回升(例如,在类型恐怖中,'船'被计数一次,'海滩'被计数一次,'家'被计数一次)。什么是最好的和最有效的方式去做这件事?我试图映射字典数据框,然后由列表分组:

myDF['genres'] = myDF['movie'].map(genreDict) 

将返回:

movie scene environment    genres 
0 movie_X  1  home   [romance, action] 
1 movie_X  2   car   [romance, action] 
2 movie_X  3  home   [romance, action] 
3 movie_Y  1  home [comedy, romance, action] 
4 movie_Y  2  office [comedy, romance, action] 
5 movie_Z  1  boat [horror, thriller, romance] 
6 movie_Z  2  beach [horror, thriller, romance] 
7 movie_Z  3  home [horror, thriller, romance] 

但是,我得到了一个错误说列表是unhashable。希望你们都可以帮忙:)

+0

你可以发表你想要的数据集? – MaxU

回答

0

如果更大的数据帧速度是由listsnumpy.repeatnumpy.concatenateIndex.values使用numpy的重复行:

#get length of lists in column genres 
l = myDF['genres'].str.len() 
#convert column to numpy array 
vals = myDF['genres'].values 
#repeat index by lenghts 
idx = np.repeat(myDF.index, l) 
#expand rows by duplicated index values 
myDF = myDF.loc[idx] 
#flattening lists column 
myDF['genres'] = np.concatenate(vals) 
#default monotonic index (0,1,2...) 
myDF = myDF.reset_index(drop=True) 
print (myDF) 
    environment movie scene genres 
0   home movie_X  1 romance 
1   home movie_X  1 action 
2   car movie_X  2 romance 
3   car movie_X  2 action 
4   home movie_X  3 romance 
5   home movie_X  3 action 
6   home movie_Y  1 comedy 
7   home movie_Y  1 romance 
8   home movie_Y  1 action 
9  office movie_Y  2 comedy 
10  office movie_Y  2 romance 
11  office movie_Y  2 action 
12  boat movie_Z  1 horror 
13  boat movie_Z  1 thriller 
14  boat movie_Z  1 romance 
15  beach movie_Z  2 horror 
16  beach movie_Z  2 thriller 
17  beach movie_Z  2 romance 
18  home movie_Z  3 horror 
19  home movie_Z  3 thriller 
20  home movie_Z  3 romance 

然后用groupby和聚集size

df1 = df.groupby(['genres','environment']).size().reset_index(name='count') 
print (df1) 
     genres environment count 
0  action   car  1 
1  action  home  3 
2  action  office  1 
3  comedy  home  1 
4  comedy  office  1 
5  horror  beach  1 
6  horror  boat  1 
7  horror  home  1 
8 romance  beach  1 
9 romance  boat  1 
10 romance   car  1 
11 romance  home  4 
12 romance  office  1 
13 thriller  beach  1 
14 thriller  boat  1 
15 thriller  home  1 
2

非标量物体一般会造成熊猫问题。除此之外,您需要整理数据,以便您的后续步骤更轻松(表格结构上的主要操作通常定义在整洁的数据集上)。你需要一个数据集,你不需要在一行中列出所有流派,而是每个流派都有自己的行。

下面是可能的方式来实现这一目标之一:

genre_df = pd.DataFrame(myDF['movie'].map(genreDict).tolist()) 

df = myDF.join(genre_df.stack().rename('genre').reset_index(level=1, drop=True)) 
df 
Out: 
    environment movie scene  genre 
0  home movie_X  1 romance 
0  home movie_X  1 action 
1   car movie_X  2 romance 
1   car movie_X  2 action 
2  home movie_X  3 romance 
2  home movie_X  3 action 
3  home movie_Y  1 comedy 
3  home movie_Y  1 romance 
3  home movie_Y  1 action 
4  office movie_Y  2 comedy 
4  office movie_Y  2 romance 
4  office movie_Y  2 action 
5  boat movie_Z  1 horror 
5  boat movie_Z  1 thriller 
5  boat movie_Z  1 romance 
6  beach movie_Z  2 horror 
6  beach movie_Z  2 thriller 
6  beach movie_Z  2 romance 
7  home movie_Z  3 horror 
7  home movie_Z  3 thriller 
7  home movie_Z  3 romance 

一旦你有这样的结构,它是组或跨容易得多制表你的数据:

df.groupby('genre').size() 
Out: 
genre 
action  5 
comedy  2 
horror  3 
romance  8 
thriller 3 
dtype: int64 

pd.crosstab(df['genre'], df['environment']) 
Out: 
environment beach boat car home office 
genre          
action   0  0 1  3  1 
comedy   0  0 0  1  1 
horror   1  1 0  1  0 
romance   1  1 1  4  1 
thriller   1  1 0  1  0 

这里有一个Hadley Wickham的精彩阅读:Tidy Data