我想你需要另一列添加到参数id_vars
在melt
:
df = df.rename(columns={'overallrating':'rating'})
tidy = pd.melt(df,
id_vars=['skunumber','category'],
var_name='dimension',
value_name='length')
tidy = tidy.drop_duplicates()
print (tidy)
skunumber category dimension length
0 123 Cat1 rating 1
1 124 cat1 rating 2
4 123 Cat1 reviews 20
5 124 cat1 reviews 23
messy1 = tidy.set_index(['skunumber','category','dimension'])
.length
.unstack()
.reset_index()
messy1.columns.name = None
print (messy1)
skunumber category rating reviews
0 123 Cat1 1 20
1 124 cat1 2 23
stack
另一个simplier解决方案,drop_duplicates
(默认情况下只保留first
值),最后unstack
:
df = df.rename(columns={'overallrating':'rating'})
tidy = df.set_index(['skunumber','category'])
.stack()
.drop_duplicates()
.unstack()
.reset_index()
print (tidy)
skunumber category rating reviews
0 123 Cat1 1 20
1 124 cat1 2 23
如果真实数据的作品你可以很容易得到:
ValueError: Index contains duplicate entries, cannot reshape
然后解决方案是波纹管或其他回答:
df = pd.DataFrame({'category': ['Cat1', 'Cat1', 'cat1'],
'overallrating': [1, 5, 3],
'skunumber': [123, 123, 124],
'reviews': [20, 30, 23],
'rating': [4, 2, 2]})
print (df)
category overallrating rating reviews skunumber
0 Cat1 1 4 20 123
1 Cat1 5 2 30 123
2 cat1 3 2 23 124
Nedd groupby
按列创建新在reset_index
之前的(这里是skunumber
和category
)并且集合了一些功能,如mean
,sum
,max
,min
,first
...
df = df.rename(columns={'overallrating':'rating'})
tidy = df.groupby(['skunumber','category'])['rating'].max().unstack().reset_index()
print (tidy)
skunumber category rating rating
0 123 Cat1 5 4
1 124 cat1 3 2
编辑的评论:
如果复制需要一些聚集函数像max
,first
,sum
,mean
与groupby
:
print (df)
skunumber category overallrating rating reviews color colorShade
0 123 Cat1 1 1 12 White Red
1 123 Cat1 1 4 20 Pink Green
2 124 cat1 2 2 23 Black Blue
df = df.rename(columns={'overallrating':'rating', 'colorShade':'color'})
g = df.groupby(['skunumber','category'])
tidy1 = g['rating'].max().unstack()
print (tidy1)
rating rating
skunumber category
123 Cat1 1 4
124 cat1 2 2
tidy2 = g['color'].first().unstack()
print (tidy2)
color color
skunumber category
123 Cat1 White Red
124 cat1 Black Blue
然后concat
数据一起:
df = pd.concat([tidy1, tidy2],axis=1).reset_index()
print (df)
skunumber category rating rating color color
0 123 Cat1 1 4 White Red
1 124 cat1 2 2 Black Blue
与pd.lreshape
另一种解决方案:
tidy = pd.lreshape(df, {'rating':['rating','overallrating'], 'color':['color','colorShade']})
print (tidy)
category reviews skunumber color rating
0 Cat1 1 123 White 1
1 Cat1 20 123 Pink 4
2 cat1 23 124 Black 2
3 Cat1 1 123 Red 1
4 Cat1 20 123 Green 1
5 cat1 23 124 Blue 2
tidy = tidy.drop_duplicates(['category','skunumber'])
print (tidy)
category reviews skunumber color rating
0 Cat1 1 123 White 1
2 cat1 23 124 Black 2
感谢Jezrael,您的解决方案在我的样本数据的工作。我会尝试使用实际的数据。 –
Jezrael,我试着用我得到的实际数据**“ValueError:Index包含重复条目,无法重新设置”**错误。 –
也试图使用你给出的解决方案,但我有更多的列和哪些也是字符串列。 –