我有一个函数,它接收特定年份的数据并返回一个数据帧。如何压扁单个熊猫数据框并将它们叠加以实现新的数据框?
例如:
DF
year fruit license grade
1946 apple XYZ 1
1946 orange XYZ 1
1946 apple PQR 3
1946 orange PQR 1
1946 grape XYZ 2
1946 grape PQR 1
..
2014 grape LMN 1
注: 1)特定的许可值将只存在于一个特定的一年只有一次特定的水果(例如,XYZ只供。 1946年,苹果,橙和葡萄只有一次)。 2)等级值是分类的。
我意识到下面的功能并不是非常有效的达到预期的目标, 但这是我目前的工作。
def func(df, year):
#1. Filter out only the data for the year needed
df_year=df[df['year']==year]
'''
2. Transform DataFrame to the form:
XYZ PQR .. LMN
apple 1 3 1
orange 1 1 3
grape 2 1 1
Note that 'LMN' is just used for representation purposes.
It won't logically appear here because it can only appear for the year 2014.
'''
df_year = df_year.pivot(index='fruit',columns='license',values='grade')
#3. Remove all fruits that have ANY NaN values
df_year=df_year.dropna(axis=1, how="any")
#4. Some additional filtering
#5. Function to calculate similarity between fruits
def similarity_score(fruit1, fruit2):
agreements=np.sum( ((fruit1 == 1) & (fruit2 == 1)) | \
( (fruit1 == 3) & (fruit2 == 3)))
disagreements=np.sum( ((fruit1 == 1) & (fruit2 == 3)) |\
( (fruit1 == 3) & (fruit2 == 1)))
return (((agreements-disagreements) /float(len(fruit1))) +1)/2)
#6. Create Network dataframe
network_df=pd.DataFrame(columns=['Source','Target','Weight'])
for i,c in enumerate(combinations(df_year,2)):
c1=df[[c[0]]].values.tolist()
c2=df[[c[1]]].values.tolist()
c1=[item for sublist in c1 for item in sublist]
c2=[item for sublist in c2 for item in sublist]
network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)]
return network_df
运行上面给出:
df_1946=func(df,1946)
df_1946.head()
Source Target Weight
Apple Orange 0.6
Apple Grape 0.3
Orange Grape 0.7
我想变平以上,以单行:
(Apple,Orange) (Apple,Grape) (Orange,Grape)
1946 0.6 0.3 0.7
注意上面不会有3列,但实际上各地5000列。
最后,我想堆栈转换数据框行得到的东西,如:
df_all_years
(Apple,Orange) (Apple,Grape) (Orange,Grape)
1946 0.6 0.3 0.7
1947 0.7 0.25 0.8
..
2015 0.75 0.3 0.65
什么是做到这一点的最好方法是什么?
'(苹果,橙)' - 它是一个字符串或一个元组? – MaxU
元组。你可以使用任何你喜欢的东西,只要有一种方法可以告诉特定单元格代表什么组合。 – Melsauce