2017-05-07 284 views
3

我有2个dataframes,我想借此一列从一个和多个基于价值观的第二创建新列(其他)列大熊猫 - 多列

第一数据框(合并dataframes有条件df1):

df1 = pd.DataFrame({'cond': np.repeat([1,2], 5), 
        'point': np.tile(np.arange(1,6), 2), 
        'value1': np.random.rand(10), 
        'unused1': np.random.rand(10)}) 

    cond point unused1 value1 
0  1  1 0.923699 0.103046 
1  1  2 0.046528 0.188408 
2  1  3 0.677052 0.481349 
3  1  4 0.464000 0.807454 
4  1  5 0.180575 0.962032 
5  2  1 0.941624 0.437961 
6  2  2 0.489738 0.026166 
7  2  3 0.739453 0.109630 
8  2  4 0.338997 0.415101 
9  2  5 0.310235 0.660748 

和第二(df2):

df2 = pd.DataFrame({'cond': np.repeat([1,2], 10), 
        'point': np.tile(np.arange(1,6), 4), 
        'value2': np.random.rand(20)}) 

    cond point value2 
0  1  1 0.990252 
1  1  2 0.534813 
2  1  3 0.407325 
3  1  4 0.969288 
4  1  5 0.085832 
5  1  1 0.922026 
6  1  2 0.567615 
7  1  3 0.174402 
8  1  4 0.469556 
9  1  5 0.511182 
10  2  1 0.219902 
11  2  2 0.761498 
12  2  3 0.406981 
13  2  4 0.551322 
14  2  5 0.727761 
15  2  1 0.075048 
16  2  2 0.159903 
17  2  3 0.726013 
18  2  4 0.848213 
19  2  5 0.284404 

df1['value1']包含EAC值h组合condpoint

我想在df2包含来自df1['value1']值来创建一个新的列(new_column),但值应该在哪里condpoint跨过2个dataframes匹配的人。

所以我期望的输出是这样的:

cond point value2 new_column 
0  1  1 0.990252 0.103046 
1  1  2 0.534813 0.188408 
2  1  3 0.407325 0.481349 
3  1  4 0.969288 0.807454 
4  1  5 0.085832 0.962032 
5  1  1 0.922026 0.103046 
6  1  2 0.567615 0.188408 
7  1  3 0.174402 0.481349 
8  1  4 0.469556 0.807454 
9  1  5 0.511182 0.962032 
10  2  1 0.219902 0.437961 
11  2  2 0.761498 0.026166 
12  2  3 0.406981 0.109630 
13  2  4 0.551322 0.415101 
14  2  5 0.727761 0.660748 
15  2  1 0.075048 0.437961 
16  2  2 0.159903 0.026166 
17  2  3 0.726013 0.109630 
18  2  4 0.848213 0.415101 
19  2  5 0.284404 0.660748 

在这个例子中,我可以只使用瓦/重复,但在现实中df1['value1']不适合这么整齐地进入其他数据帧。所以,我只是需要做的是基于匹配的condpoint

我已经试过将它们合并,但1)数字不似乎匹配和2)我不想从df1带过来的任何未使用的列:

df1.merge(df2, left_on=['cond', 'point'], right_on=['cond', 'point'])

请告诉我正确的方式,而不必通过2个dataframes迭代添加这个新列?

回答

2

选项1
对于恩和速度与纯pandas,我们可以使用lookup
这将产生相同的输出,因为所有的其它选择,如下所示。

这个概念是将查找数据表示为二维数组和索引查找值。

d1 = df1.set_index(['cond', 'point']).value1.unstack() 
df2.assign(new_column=d1.lookup(df2.cond, df2.point)) 

选项2
我们可以做同样的事情numpy如果值以同样的方式,他们都在df1提出以提高性能。这非常快!

a = df1.value1.values.reshape(2, -1) 
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1]) 

选项3
的规范答案是使用merge with the left parameter
但是我们需要预习df1有点钉输出

d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'}) 
df2.merge(d1, 'left') 

选项4
我觉得这很有趣。构建映射字典和一系列地图
适合小数据,不适合大数据。见下面的时间。

c1 = df1.cond.values.tolist() 
p1 = df1.point.values.tolist() 
v1 = df1.value1.values.tolist() 
m = {(c, p): v for c, p, v in zip(c1, p1, v1)} 

c2 = df2.cond.values.tolist() 
p2 = df2.point.values.tolist() 
i2 = df2.index.values.tolist() 
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)}) 

df2.assign(new_column=s2.map(m)) 

OUTPUT

cond point value2 new_column 
0  1  1 0.990252 0.103046 
1  1  2 0.534813 0.188408 
2  1  3 0.407325 0.481349 
3  1  4 0.969288 0.807454 
4  1  5 0.085832 0.962032 
5  1  1 0.922026 0.103046 
6  1  2 0.567615 0.188408 
7  1  3 0.174402 0.481349 
8  1  4 0.469556 0.807454 
9  1  5 0.511182 0.962032 
10  2  1 0.219902 0.437961 
11  2  2 0.761498 0.026166 
12  2  3 0.406981 0.109630 
13  2  4 0.551322 0.415101 
14  2  5 0.727761 0.660748 
15  2  1 0.075048 0.437961 
16  2  2 0.159903 0.026166 
17  2  3 0.726013 0.109630 
18  2  4 0.848213 0.415101 
19  2  5 0.284404 0.660748 

时序
小数据

%%timeit 
a = df1.value1.values.reshape(2, -1) 
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1]) 
1000 loops, best of 3: 304 µs per loop 

%%timeit 
d1 = df1.set_index(['cond', 'point']).value1.unstack() 
df2.assign(new_column=d1.lookup(df2.cond, df2.point)) 
100 loops, best of 3: 1.8 ms per loop 

%%timeit 
c1 = df1.cond.values.tolist() 
p1 = df1.point.values.tolist() 
v1 = df1.value1.values.tolist() 
m = {(c, p): v for c, p, v in zip(c1, p1, v1)} 
​ 
c2 = df2.cond.values.tolist() 
p2 = df2.point.values.tolist() 
i2 = df2.index.values.tolist() 
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)}) 
​ 
df2.assign(new_column=s2.map(m)) 
1000 loops, best of 3: 719 µs per loop 

%%timeit 
d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'}) 
df2.merge(d1, 'left') 
100 loops, best of 3: 2.04 ms per loop 

%%timeit 
df = pd.merge(df2, df1.drop('unused1', axis=1), 'left') 
df.rename(columns={'value1': 'new_column'}) 
100 loops, best of 3: 2.01 ms per loop 

%%timeit 
df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point']) 
df.rename(columns={'value1': 'new_column'}) 
100 loops, best of 3: 2.15 ms per loop 

大数据

df2 = pd.concat([df2] * 10000, ignore_index=True) 

%%timeit 
a = df1.value1.values.reshape(2, -1) 
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1]) 
1000 loops, best of 3: 1.93 ms per loop 

%%timeit 
d1 = df1.set_index(['cond', 'point']).value1.unstack() 
df2.assign(new_column=d1.lookup(df2.cond, df2.point)) 
100 loops, best of 3: 5.58 ms per loop 

%%timeit 
c1 = df1.cond.values.tolist() 
p1 = df1.point.values.tolist() 
v1 = df1.value1.values.tolist() 
m = {(c, p): v for c, p, v in zip(c1, p1, v1)} 
​ 
c2 = df2.cond.values.tolist() 
p2 = df2.point.values.tolist() 
i2 = df2.index.values.tolist() 
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)}) 
​ 
df2.assign(new_column=s2.map(m)) 
10 loops, best of 3: 135 ms per loop 

%%timeit 
d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'}) 
df2.merge(d1, 'left') 
100 loops, best of 3: 13.4 ms per loop 

%%timeit 
df = pd.merge(df2, df1.drop('unused1', axis=1), 'left') 
df.rename(columns={'value1': 'new_column'}) 
10 loops, best of 3: 19.8 ms per loop 

%%timeit 
df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point']) 
df.rename(columns={'value1': 'new_column'}) 
100 loops, best of 3: 18.2 ms per loop 
+0

由于@jezrael。你也是。 – piRSquared

2

您可以使用mergeleft joindrop用于去除unused1列,最后rename柱:

注意:参数on可如果在这两个DataFrames被忽略只有加入的列是相同的。如果列名更相同,请添加on=['cond', 'point']

df = pd.merge(df2, df1.drop('unused1', axis=1), 'left') 
df = df.rename(columns={'value1': 'new_column'}) 
print (df) 
    cond point value2 new_column 
0  1  1 0.990252 0.103046 
1  1  2 0.534813 0.188408 
2  1  3 0.407325 0.481349 
3  1  4 0.969288 0.807454 
4  1  5 0.085832 0.962032 
5  1  1 0.922026 0.103046 
6  1  2 0.567615 0.188408 
7  1  3 0.174402 0.481349 
8  1  4 0.469556 0.807454 
9  1  5 0.511182 0.962032 
10  2  1 0.219902 0.437961 
11  2  2 0.761498 0.026166 
12  2  3 0.406981 0.109630 
13  2  4 0.551322 0.415101 
14  2  5 0.727761 0.660748 
15  2  1 0.075048 0.437961 
16  2  2 0.159903 0.026166 
17  2  3 0.726013 0.109630 
18  2  4 0.848213 0.415101 
19  2  5 0.284404 0.660748 

join(默认left join)与set_index + drop另一种解决方案:

df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point']) 
df = df.rename(columns={'value1': 'new_column'}) 
print (df) 
    cond point value2 new_column 
0  1  1 0.990252 0.103046 
1  1  2 0.534813 0.188408 
2  1  3 0.407325 0.481349 
3  1  4 0.969288 0.807454 
4  1  5 0.085832 0.962032 
5  1  1 0.922026 0.103046 
6  1  2 0.567615 0.188408 
7  1  3 0.174402 0.481349 
8  1  4 0.469556 0.807454 
9  1  5 0.511182 0.962032 
10  2  1 0.219902 0.437961 
11  2  2 0.761498 0.026166 
12  2  3 0.406981 0.109630 
13  2  4 0.551322 0.415101 
14  2  5 0.727761 0.660748 
15  2  1 0.075048 0.437961 
16  2  2 0.159903 0.026166 
17  2  3 0.726013 0.109630 
18  2  4 0.848213 0.415101 
19  2  5 0.284404 0.660748