2017-08-19 41 views
0

我有一个函数,它接收特定年份的数据并返回一个数据帧。如何压扁单个熊猫数据框并将它们叠加以实现新的数据框?

例如:

DF

year fruit license  grade 
1946 apple  XYZ  1 
1946 orange  XYZ  1 
1946 apple  PQR  3 
1946 orange  PQR  1 
1946 grape  XYZ  2 
1946 grape  PQR  1 
.. 
2014 grape  LMN  1 

注: 1)特定的许可值将只存在于一个特定的一年只有一次特定的水果(例如,XYZ只供。 1946年,苹果,橙和葡萄只有一次)。 2)等级值是分类的。

我意识到下面的功能并不是非常有效的达到预期的目标, 但这是我目前的工作。

def func(df, year): 
    #1. Filter out only the data for the year needed 

    df_year=df[df['year']==year] 
    ''' 
    2. Transform DataFrame to the form: 
       XYZ PQR ..  LMN 
    apple  1  3    1 
    orange  1  1    3 
    grape  2  1    1 
    Note that 'LMN' is just used for representation purposes. 
    It won't logically appear here because it can only appear for the year 2014. 
    ''' 
    df_year = df_year.pivot(index='fruit',columns='license',values='grade')  

    #3. Remove all fruits that have ANY NaN values 
    df_year=df_year.dropna(axis=1, how="any") 

    #4. Some additional filtering 

    #5. Function to calculate similarity between fruits 
    def similarity_score(fruit1, fruit2): 
     agreements=np.sum( ((fruit1 == 1) & (fruit2 == 1)) | \ 
     ( (fruit1 == 3) & (fruit2 == 3))) 

     disagreements=np.sum( ((fruit1 == 1) & (fruit2 == 3)) |\ 
     ( (fruit1 == 3) & (fruit2 == 1))) 

     return (((agreements-disagreements) /float(len(fruit1))) +1)/2) 

    #6. Create Network dataframe 
    network_df=pd.DataFrame(columns=['Source','Target','Weight']) 

    for i,c in enumerate(combinations(df_year,2)): 
     c1=df[[c[0]]].values.tolist() 
     c2=df[[c[1]]].values.tolist() 
     c1=[item for sublist in c1 for item in sublist] 
     c2=[item for sublist in c2 for item in sublist] 
     network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)] 

    return network_df 

运行上面给出:

df_1946=func(df,1946) 
df_1946.head() 

Source Target Weight 
Apple  Orange  0.6 
Apple  Grape  0.3 
Orange Grape  0.7 

我想变平以上,以单行:

 (Apple,Orange) (Apple,Grape) (Orange,Grape) 
1946  0.6    0.3   0.7 

注意上面不会有3列,但实际上各地5000列。

最后,我想堆栈转换数据框行得到的东西,如:

df_all_years

 (Apple,Orange) (Apple,Grape) (Orange,Grape) 
1946  0.6    0.3   0.7 
1947  0.7    0.25   0.8 
.. 
2015  0.75   0.3   0.65 

什么是做到这一点的最好方法是什么?

+0

'(苹果,橙)' - 它是一个字符串或一个元组? – MaxU

+0

元组。你可以使用任何你喜欢的东西,只要有一种方法可以告诉特定单元格代表什么组合。 – Melsauce

回答

2

我会重新排列计算有点不同。 而是循环多年来的:

for year in range(1946, 2015): 
    partial_result = func(df, year) 

然后连接部分结果,可以通过调用df.groupby(...)之前做尽可能多的工作,尽可能减少对整个数据帧,df, 得到 更好的性能。此外,如果您可以使用sumcount等内置聚合器表示计算,则与使用groupby/apply的自定义函数相比,可以更快地完成计算。

import itertools as IT 
import numpy as np 
import pandas as pd 
np.random.seed(2017) 

def make_df(): 
    N = 10000 
    df = pd.DataFrame({'fruit': np.random.choice(['Apple', 'Orange', 'Grape'], size=N), 
         'grade': np.random.choice([1,2,3], p=[0.7,0.1,0.2], size=N), 
         'year': np.random.choice(range(1946,1950), size=N)}) 
    df['manufacturer'] = (df['year'].astype(str) + '-' 
          + df.groupby(['year', 'fruit'])['fruit'].cumcount().astype(str)) 
    df = df.sort_values(by=['year']) 
    return df 

def similarity_score(df): 
    """ 
    Compute the score between each pair of columns in df 
    """ 
    agreements = {} 
    disagreements = {} 
    for col in IT.combinations(df,2): 
     fruit1 = df[col[0]].values 
     fruit2 = df[col[1]].values 
     agreements[col] = (((fruit1 == 1) & (fruit2 == 1)) 
          | ((fruit1 == 3) & (fruit2 == 3))) 
     disagreements[col] = (((fruit1 == 1) & (fruit2 == 3)) 
           | ((fruit1 == 3) & (fruit2 == 1))) 
    agreements = pd.DataFrame(agreements, index=df.index) 
    disagreements = pd.DataFrame(disagreements, index=df.index) 
    numerator = agreements.astype(int)-disagreements.astype(int) 
    grouped = numerator.groupby(level='year') 
    total = grouped.sum() 
    count = grouped.count() 
    score = ((total/count) + 1)/2 
    return score 

df = make_df() 
df2 = df.set_index(['year','fruit','manufacturer'])['grade'].unstack(['fruit']) 
df2 = df2.dropna(axis=0, how="any") 

print(similarity_score(df2)) 

产生

  Grape Orange   
     Apple  Apple  Grape 
year        
1946 0.629111 0.650426 0.641900 
1947 0.644388 0.639344 0.633039 
1948 0.613117 0.630566 0.616727 
1949 0.634176 0.635379 0.637786 
+0

我已编辑的问题,并确定双方DF和FUNC键,这样就可以让正在发生的事情的一个更好的主意。乐意提供更多信息。 – Melsauce

1

这里做一个熊猫常规转动的表,你指的是这样的一种方式;而它可以处理大约5,000列 - 由两个最初分开的类组合而成 - 足够快(瓶颈步骤在我的四核MacBook上花费了大约20秒),对于大得多的缩放,确实有更快的策略。这个例子中的数据非常稀少(5K列,来自70行数[1947-2016]的5K随机样本),因此执行时间可能会延长数秒,并且数据帧更完整。

from itertools import chain 
import pandas as pd 
import numpy as np 
import random # using python3 .choices() 
import re 

# Make bivariate data w/ 5000 total combinations (1000x5 categories) 
# Also choose 5,000 randomly; some combinations may have >1 values or NaN 
random_sample_data = np.array(
    [random.choices(['Apple', 'Orange', 'Lemon', 'Lime'] + 
        ['of Fruit' + str(i) for i in range(1000)], 
        k=5000), 
    random.choices(['Grapes', 'Are Purple', 'And Make Wine', 
        'From the Yeast', 'That Love Sugar'], 
        k=5000), 
    [random.random() for _ in range(5000)]] 
).T 
df = pd.DataFrame(random_sample_data, columns=[ 
        "Source", "Target", "Weight"]) 
df['Year'] = random.choices(range(1947, 2017), k=df.shape[0]) 

# Three views of resulting df in jupyter notebook: 
df 
df[df.Year == 1947] 
df.groupby(["Source", "Target"]).count().unstack() 

enter image description here

为了展平分组按年数据,因为GROUPBY需要一个功能应用,您可以使用临时DF中介:

  1. 推动所有data.groupby("Year")成单个行,但每个“Target”+“Source”(稍后扩展)以及“Weight”两列分别具有不同的数据框。
  2. 使用zippd.core.reshape.util.cartesian_product创建一个空的适当形状的支点DF这将是最后的表,从temp_df产生。

例如,

df_temp = df.groupby("Year").apply(
    lambda s: pd.DataFrame([(s.Target, s.Source, s.Weight)], 
          columns=["Target", "Source", "Weight"]) 
).sort_index() 
df_temp.index = df_temp.index.droplevel(1) # reduce MultiIndex to 1-d 

# Predetermine all possible pairwise column category combinations 
product_ts = [*zip(*(pd.core.reshape.util.cartesian_product(
    [df.Target.unique(), df.Source.unique()]) 
))] 

ts_combinations = [str(x + ' ' + y) for (x, y) in product_ts] 

ts_combinations 

enter image description here

最后,使用简单的,嵌套迭代(再次,不是最快的,但pd.DataFrame.iterrows可能有助于加快速度,如图所示)。因为更换随机抽样的,我不得不处理多个值,所以你可能会想删除第二个for循环,这是步骤,其中三个独立dataframes是,每年可为以下的条件,因此压缩到单行所有细胞通过pivoted(“Weight”)x(“Target” - “Source”)关系。

df_pivot = pd.DataFrame(np.zeros((70, 5000)), 
         columns=ts_combinations) 
df_pivot.index = df_temp.index 

for year, values in df_temp.iterrows(): 

    for (target, source, weight) in zip(*values): 

     bivar_pair = str(target + ' ' + source) 
     curr_weight = df_pivot.loc[year, bivar_pair] 

     if curr_weight == 0.0: 
      df_pivot.loc[year, bivar_pair] = [weight] 
     # append additional values if encountered 
     elif type(curr_weight) == list: 
      df_pivot.loc[year, bivar_pair] = str(curr_weight + 
               [weight]) 

enter image description here

# Spotcheck: 
# Verifies matching data in pivoted table vs. original for Target+Source 
# combination "And Make Wine of Fruit614" across all 70 years 1947-2016 
df 
df_pivot['And Make Wine of Fruit614'] 
df[(df.Year == 1947) & (df.Target == 'And Make Wine') & (df.Source == 'of Fruit614')]