如何压扁单个熊猫数据框并将它们叠加以实现新的数据框？

我有一个函数，它接收特定年份的数据并返回一个数据帧。如何压扁单个熊猫数据框并将它们叠加以实现新的数据框？

例如：

year fruit license  grade 
1946 apple  XYZ  1 
1946 orange  XYZ  1 
1946 apple  PQR  3 
1946 orange  PQR  1 
1946 grape  XYZ  2 
1946 grape  PQR  1 
.. 
2014 grape  LMN  1

注： 1）特定的许可值将只存在于一个特定的一年只有一次特定的水果（例如，XYZ只供。 1946年，苹果，橙和葡萄只有一次）。 2）等级值是分类的。

我意识到下面的功能并不是非常有效的达到预期的目标，但这是我目前的工作。

def func(df, year): 
    #1. Filter out only the data for the year needed 

    df_year=df[df['year']==year] 
    ''' 
    2. Transform DataFrame to the form: 
       XYZ PQR ..  LMN 
    apple  1  3    1 
    orange  1  1    3 
    grape  2  1    1 
    Note that 'LMN' is just used for representation purposes. 
    It won't logically appear here because it can only appear for the year 2014. 
    ''' 
    df_year = df_year.pivot(index='fruit',columns='license',values='grade')  

    #3. Remove all fruits that have ANY NaN values 
    df_year=df_year.dropna(axis=1, how="any") 

    #4. Some additional filtering 

    #5. Function to calculate similarity between fruits 
    def similarity_score(fruit1, fruit2): 
     agreements=np.sum( ((fruit1 == 1) & (fruit2 == 1)) | \ 
     ( (fruit1 == 3) & (fruit2 == 3))) 

     disagreements=np.sum( ((fruit1 == 1) & (fruit2 == 3)) |\ 
     ( (fruit1 == 3) & (fruit2 == 1))) 

     return (((agreements-disagreements) /float(len(fruit1))) +1)/2) 

    #6. Create Network dataframe 
    network_df=pd.DataFrame(columns=['Source','Target','Weight']) 

    for i,c in enumerate(combinations(df_year,2)): 
     c1=df[[c[0]]].values.tolist() 
     c2=df[[c[1]]].values.tolist() 
     c1=[item for sublist in c1 for item in sublist] 
     c2=[item for sublist in c2 for item in sublist] 
     network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)] 

    return network_df

运行上面给出：

df_1946=func(df,1946) 
df_1946.head() 

Source Target Weight 
Apple  Orange  0.6 
Apple  Grape  0.3 
Orange Grape  0.7

我想变平以上，以单行：

 (Apple,Orange) (Apple,Grape) (Orange,Grape) 
1946  0.6    0.3   0.7

注意上面不会有3列，但实际上各地5000列。

最后，我想堆栈转换数据框行得到的东西，如：

df_all_years

 (Apple,Orange) (Apple,Grape) (Orange,Grape) 
1946  0.6    0.3   0.7 
1947  0.7    0.25   0.8 
.. 
2015  0.75   0.3   0.65

什么是做到这一点的最好方法是什么？

来源

2017-08-19 Melsauce

'（苹果，橙）' - 它是一个字符串或一个元组？ – MaxU

元组。你可以使用任何你喜欢的东西，只要有一种方法可以告诉特定单元格代表什么组合。 – Melsauce

我会重新排列计算有点不同。而是循环多年来的：

for year in range(1946, 2015): 
    partial_result = func(df, year)

然后连接部分结果，可以通过调用df.groupby(...)之前做尽可能多的工作，尽可能减少对整个数据帧，df，得到更好的性能。此外，如果您可以使用sum和count等内置聚合器表示计算，则与使用groupby/apply的自定义函数相比，可以更快地完成计算。

import itertools as IT 
import numpy as np 
import pandas as pd 
np.random.seed(2017) 

def make_df(): 
    N = 10000 
    df = pd.DataFrame({'fruit': np.random.choice(['Apple', 'Orange', 'Grape'], size=N), 
         'grade': np.random.choice([1,2,3], p=[0.7,0.1,0.2], size=N), 
         'year': np.random.choice(range(1946,1950), size=N)}) 
    df['manufacturer'] = (df['year'].astype(str) + '-' 
          + df.groupby(['year', 'fruit'])['fruit'].cumcount().astype(str)) 
    df = df.sort_values(by=['year']) 
    return df 

def similarity_score(df): 
    """ 
    Compute the score between each pair of columns in df 
    """ 
    agreements = {} 
    disagreements = {} 
    for col in IT.combinations(df,2): 
     fruit1 = df[col[0]].values 
     fruit2 = df[col[1]].values 
     agreements[col] = (((fruit1 == 1) & (fruit2 == 1)) 
          | ((fruit1 == 3) & (fruit2 == 3))) 
     disagreements[col] = (((fruit1 == 1) & (fruit2 == 3)) 
           | ((fruit1 == 3) & (fruit2 == 1))) 
    agreements = pd.DataFrame(agreements, index=df.index) 
    disagreements = pd.DataFrame(disagreements, index=df.index) 
    numerator = agreements.astype(int)-disagreements.astype(int) 
    grouped = numerator.groupby(level='year') 
    total = grouped.sum() 
    count = grouped.count() 
    score = ((total/count) + 1)/2 
    return score 

df = make_df() 
df2 = df.set_index(['year','fruit','manufacturer'])['grade'].unstack(['fruit']) 
df2 = df2.dropna(axis=0, how="any") 

print(similarity_score(df2))

产生

  Grape Orange   
     Apple  Apple  Grape 
year        
1946 0.629111 0.650426 0.641900 
1947 0.644388 0.639344 0.633039 
1948 0.613117 0.630566 0.616727 
1949 0.634176 0.635379 0.637786

来源

2017-08-19 21:23:45 unutbu

我已编辑的问题，并确定双方DF和FUNC键，这样就可以让正在发生的事情的一个更好的主意。乐意提供更多信息。 – Melsauce

这里做一个熊猫常规转动的表，你指的是这样的一种方式;而它可以处理大约5,000列 - 由两个最初分开的类组合而成 - 足够快（瓶颈步骤在我的四核MacBook上花费了大约20秒），对于大得多的缩放，确实有更快的策略。这个例子中的数据非常稀少（5K列，来自70行数[1947-2016]的5K随机样本），因此执行时间可能会延长数秒，并且数据帧更完整。

from itertools import chain 
import pandas as pd 
import numpy as np 
import random # using python3 .choices() 
import re 

# Make bivariate data w/ 5000 total combinations (1000x5 categories) 
# Also choose 5,000 randomly; some combinations may have >1 values or NaN 
random_sample_data = np.array(
    [random.choices(['Apple', 'Orange', 'Lemon', 'Lime'] + 
        ['of Fruit' + str(i) for i in range(1000)], 
        k=5000), 
    random.choices(['Grapes', 'Are Purple', 'And Make Wine', 
        'From the Yeast', 'That Love Sugar'], 
        k=5000), 
    [random.random() for _ in range(5000)]] 
).T 
df = pd.DataFrame(random_sample_data, columns=[ 
        "Source", "Target", "Weight"]) 
df['Year'] = random.choices(range(1947, 2017), k=df.shape[0]) 

# Three views of resulting df in jupyter notebook: 
df 
df[df.Year == 1947] 
df.groupby(["Source", "Target"]).count().unstack()

为了展平分组按年数据，因为GROUPBY需要一个功能应用，您可以使用临时DF中介：

推动所有data.groupby("Year")成单个行，但每个“Target”+“Source”（稍后扩展）以及“Weight”两列分别具有不同的数据框。
使用zip和pd.core.reshape.util.cartesian_product创建一个空的适当形状的支点DF这将是最后的表，从temp_df产生。

例如，

df_temp = df.groupby("Year").apply(
    lambda s: pd.DataFrame([(s.Target, s.Source, s.Weight)], 
          columns=["Target", "Source", "Weight"]) 
).sort_index() 
df_temp.index = df_temp.index.droplevel(1) # reduce MultiIndex to 1-d 

# Predetermine all possible pairwise column category combinations 
product_ts = [*zip(*(pd.core.reshape.util.cartesian_product(
    [df.Target.unique(), df.Source.unique()]) 
))] 

ts_combinations = [str(x + ' ' + y) for (x, y) in product_ts] 

ts_combinations

最后，使用简单的，嵌套迭代（再次，不是最快的，但pd.DataFrame.iterrows可能有助于加快速度，如图所示）。因为更换随机抽样的，我不得不处理多个值，所以你可能会想删除第二个for循环，这是步骤，其中三个独立dataframes是，每年可为以下的条件，因此压缩到单行所有细胞通过pivoted（“Weight”）x（“Target” - “Source”）关系。

df_pivot = pd.DataFrame(np.zeros((70, 5000)), 
         columns=ts_combinations) 
df_pivot.index = df_temp.index 

for year, values in df_temp.iterrows(): 

    for (target, source, weight) in zip(*values): 

     bivar_pair = str(target + ' ' + source) 
     curr_weight = df_pivot.loc[year, bivar_pair] 

     if curr_weight == 0.0: 
      df_pivot.loc[year, bivar_pair] = [weight] 
     # append additional values if encountered 
     elif type(curr_weight) == list: 
      df_pivot.loc[year, bivar_pair] = str(curr_weight + 
               [weight])

# Spotcheck: 
# Verifies matching data in pivoted table vs. original for Target+Source 
# combination "And Make Wine of Fruit614" across all 70 years 1947-2016 
df 
df_pivot['And Make Wine of Fruit614'] 
df[(df.Year == 1947) & (df.Target == 'And Make Wine') & (df.Source == 'of Fruit614')]

来源

2017-08-20 05:51:21 johnxcollins

如何压扁单个熊猫数据框并将它们叠加以实现新的数据框？

回答

相关问题