2016-09-12 62 views
2

电池我有2个dataframes,df1df2,并要做到以下几点,结果存储在df3比较2个熊猫dataframes,逐行,通过细胞

for each row in df1: 

    for each row in df2: 

     create a new row in df3 (called "df1-1, df2-1" or whatever) to store results 

     for each cell(column) in df1: 

      for the cell in df2 whose column name is the same as for the cell in df1: 

       compare the cells (using some comparing function func(a,b)) and, 
       depending on the result of the comparison, write result into the 
       appropriate column of the "df1-1, df2-1" row of df3) 

例如,像:

df1 
A B C  D 
foo bar foobar 7 
gee whiz herp 10 

df2 
A B C  D 
zoo car foobar 8 

df3 
df1-df2 A    B    C     D 
foo-zoo func(foo,zoo) func(bar,car) func(foobar,foobar) func(7,8) 
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar) func(10,8) 

我已经开始与此:

for r1 in df1.iterrows(): 
    for r2 in df2.iterrows(): 
     for c1 in r1: 
      for c2 in r2: 

,但我不知道该怎么办,并希望得到一些帮助。

+0

因为你应用FUNC同名的列,你可以遍历仅通过列和使用矢量化,例如df3 ['A'] = func(df1 ['A'],df2 ['A']),等等? – StarFox

+0

@StarFox有趣,所以我可能会做类似于:df3中的列:df3 [column] = func(df1 [column],df2 [column])? – Zubo

+0

当然!这就是熊猫/ numpy的力量(一般来说,矢量化)。我将在下面提供一些示例,并且我们将从那里开始 – StarFox

回答

2

因此,为了继续评论中的讨论,您可以使用矢量化,这是像熊猫或numpy这样的图书馆的卖点之一。理想情况下,你永远不应该打电话给iterrows()。为了一点更加明确我的建议:

# with df1 and df2 provided as above, an example 
df3 = df1['A'] * 3 + df2['A'] 

# recall that df2 only has the one row so pandas will broadcast a NaN there 
df3 
0 foofoofoozoo 
1    NaN 
Name: A, dtype: object 

# more generally 

# we know that df1 and df2 share column names, so we can initialize df3 with those names 
df3 = pd.DataFrame(columns=df1.columns) 
for colName in df1: 
    df3[colName] = func(df1[colName], df2[colName]) 

现在,你可以甚至通过应用不同的功能不同的列,比如,创建lambda函数,然后与列名荏苒他们:

# some example functions 
colAFunc = lambda x, y: x + y 
colBFunc = lambda x, y; x - y 
.... 
columnFunctions = [colAFunc, colBFunc, ...] 

# initialize df3 as above 
df3 = pd.DataFrame(columns=df1.columns) 
for func, colName in zip(columnFunctions, df1.columns): 
    df3[colName] = func(df1[colName], df2[colName]) 

想到的唯一“难题”是您需要确保您的功能适用于列中的数据。例如,如果您要执行类似df1['A'] - df2['A'](与您所提供的df1,df2一样),则会产生一个ValueError,因为两个字符串的相减是未定义的。只是要注意的事情。


编辑回复:您的评论:这是可行的也是如此。迭代是较大dfX.columns,这样你就不会碰到KeyError,并抛出一个if语句有:

# all the other jazz 
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']] 
# so iterate over df2 columns 
for colName in df2: 
    if colName not in df1: 
     df3[colName] = np.nan # be sure to import numpy as np 
    else: 
     df3[colName] = func(df1[colName], df2[colName]) 
+0

是的,这是非常有用的,我已经接受它作为答案,非常感谢花时间!如果列数不相等,可以修改这个值吗?即,df1中可能存在df2中不存在的列;比较函数应该只输出类似N/A的内容。 – Zubo