2016-05-13 98 views
0

我有四个熊猫数据框(A,B,CD)。 A具有一系列时间戳和一个列,其指的是其他DataFrames之一:其他多个熊猫数据框的条件合并

A 

Timestamp Source 
----------- ------ 
2012-4-3  B 
2013-12-20 C 
2012-3-5  C 
2014-12-7 D 
2012-7-10 B 
... 

另一DataFrames容纳更多的数据:

B 

Timestamp Foo Bar 
----------- ---- ---- 
2012-1-1 1.5 1.3 
2012-1-2 2.3 5.6 
2012-1-3 3.4 3.3 
... 
2014-3-31 0.8 2.1 

C 

Timestamp Foo Bar 
----------- ---- ---- 
2012-1-1 9.2 5.6 
2012-1-2 4.8 7.6 
2012-1-3 2.7 6.4 
... 
2014-3-31 7.0 6.5 

D 

Timestamp Foo Bar 
----------- ---- ---- 
2012-1-1 6.8 4.2 
2012-1-2 4.2 9.3 
2012-1-3 5.5 0.7 
... 
2014-3-31 6.3 2.0 

我想从A构造单个数据帧,其中FooBar的值来自相应的在DataFrame中的列为SourceA

并非所有的时间戳出现A在其他三个DataFrames,在这种情况下,我想的FooBar值是np.nan。并非B,CD中的所有时间戳都出现在A中,并且不会出现在最终的DataFrame中。

我目前的做法是遍历A中的每个行并从相应Source数据框返回值:

srcs = {'B': B, 'C': C, 'D': D} 
A['Foo'] = np.nan 
A['Bar'] = np.nan 

for i in range(len(A)): 
    ts = A.iloc[i].Timestamp 
    src = A.iloc[i].Source 
    A.iloc[i].Foo = srcs[src][srcs[src].Timestamp == ts].Foo 
    A.iloc[i].Bar = srcs[src][srcs[src].Timestamp == ts].Bar 

必须有一个更高效,更Pandithic的方式来执行此操作(?) ?

+0

嗯,一个方法是将源列添加到每个df与B,C,D分别设置为B,C,D,然后合并所有的时间戳和来源,不知道如何凌乱它会通过 – EdChum

+0

这不会导致与6个单独的列(例如'Foo_x','Bar_x','Foo_y','Bar_y','Foo','酒吧')的DF?我将如何将它们合并到两个列('Foo'和'Bar')的基础上? –

回答

2

看起来你可以使用多索引来做到这一点。您的索引将由时间戳和来源组成。您可以在DataFrame上使用​​方法。

下面是一些代码,用于创建一些假的DataFrame,每个都带有MultiIndex。

# Imports for creating fake data 
from random import random 
from random import choice 

# Setup the sample data 
A = pd.DataFrame({'TimeStamp':range(20), 'Source':[choice(others) for i in range(20)]}) 
# Create the MultiIndex on A 
A.set_index(['TimeStamp', 'Source'], inplace=True) 
A['Bar'] = [np.nan] * len(A) 
A['Foo'] = [np.nan] * len(A) 

B = pd.DataFrame({'TimeStamp':range(5), 
        'Foo':[random()*5+5 for i in range(5)], 
        'Bar':[random()*5+5 for i in range(5)]}) 
C = pd.DataFrame({'TimeStamp':range(5,10), 
        'Foo':[random()*5+5 for i in range(5)], 
        'Bar':[random()*5+5 for i in range(5)]}) 
D = pd.DataFrame({'TimeStamp':range(10,15), 
        'Foo':[random()*5+5 for i in range(5)], 
        'Bar':[random()*5+5 for i in range(5)]}) 

sources = {'B':B, 'C':C, 'D':D} 

# create the MultiIndex on the Source data sets 
for s, df in sources.items(): 
    df['Source'] = [s]*len(df) 
    df.set_index(['TimeStamp', 'Source'], inplace=True) 

现在您可以使用A上的索引为源数据集(B,C和D)编制索引。

for s, df in sources.items():  

    temp = df.loc[A.index] # the source data set indexed by A's index 
          # this will contain NaN's where df does not 
          # have corresponding index entries 
    temp.dropna(inplace=True) # dropping the NaN values leaves you with 
          # only the values in df matching the index in A 
    if len(temp) > 0: 
     A.loc[temp.index] = temp # now assign the data to A 

print(A) 

结果是这样的:

     Bar  Foo 
TimeStamp Source      
0   D   NaN  NaN 
1   C   NaN  NaN 
2   D   NaN  NaN 
3   B  7.927154 8.581380 
4   B  7.638422 5.970348 
5   D   NaN  NaN 
6   C  6.938001 6.417248 
7   B   NaN  NaN 
8   C  5.131940 9.144621 
9   B   NaN  NaN 
10  D  9.186963 5.991877 
11  D  8.070543 7.735040 
12  C   NaN  NaN 
13  B   NaN  NaN 
14  C   NaN  NaN 
15  D   NaN  NaN 
16  C   NaN  NaN 
17  C   NaN  NaN 
18  C   NaN  NaN 
19  B   NaN  NaN 
1

设置

import pandas as pd 
from StringIO import StringIO 

texta = """Timestamp Source 
2012-4-3  B 
2012-4-2  B 
2013-12-20 C 
2012-3-5  C 
2014-12-7 D 
2012-7-10 B""" 

A = pd.read_csv(StringIO(texta), delim_whitespace=1, parse_dates=[0]) 

textb = """Timestamp Foo Bar 
2012-1-1 1.5 1.3 
2012-4-3 3.1 4.1 
2012-1-2 2.3 5.6 
2012-1-3 3.4 3.3 
2014-3-31 0.8 2.1""" 

B = pd.read_csv(StringIO(textb), delim_whitespace=1, parse_dates=[0]) 

textc = """Timestamp Foo Bar 
2012-1-1 9.2 5.6 
2012-3-5 4.8 7.6 
2012-1-2 4.8 7.6 
2012-1-3 2.7 6.4 
2014-3-31 7.0 6.5""" 

C = pd.read_csv(StringIO(textc), delim_whitespace=1, parse_dates=[0]) 

textd = """Timestamp Foo Bar 
2012-1-1 6.8 4.2 
2012-1-2 4.2 9.3 
2012-1-3 5.5 0.7 
2014-3-31 6.3 2.0""" 

D = pd.read_csv(StringIO(textd), delim_whitespace=1, parse_dates=[0]) 

然后,我只是pd.concatBCD

bdf = pd.concat([B, C, D], keys=['B', 'C', 'D']) 
bdf.reset_index(level=1, inplace=1, drop=1) 
bdf.index.name = 'Source' 
bdf.reset_index(inplace=1) 

print bdf 
结合

它看起来像这样:

Source Timestamp Foo Bar 
0  B 2012-01-01 1.5 1.3 
1  B 2012-04-03 3.1 4.1 
2  B 2012-01-02 2.3 5.6 
3  B 2012-01-03 3.4 3.3 
4  B 2014-03-31 0.8 2.1 
5  C 2012-01-01 9.2 5.6 
6  C 2012-03-05 4.8 7.6 
7  C 2012-01-02 4.8 7.6 
8  C 2012-01-03 2.7 6.4 
9  C 2014-03-31 7.0 6.5 
10  D 2012-01-01 6.8 4.2 
11  D 2012-01-02 4.2 9.3 
12  D 2012-01-03 5.5 0.7 
13  D 2014-03-31 6.3 2.0 

最后

一个简单合并

A.merge(bdf, how='left') 

的样子:

Timestamp Source Foo Bar 
0 2012-04-03  B 3.1 4.1 
1 2012-04-02  B NaN NaN 
2 2013-12-20  C NaN NaN 
3 2012-03-05  C 4.8 7.6 
4 2014-12-07  D NaN NaN 
5 2012-07-10  B NaN NaN