2017-04-10 60 views
0

我有以下数据的片材:从句子中提取数字并计算平均值。

team1,team2,outcome 
AA,BB,BB won by 90 runs 
AA,CC,AA won by 19 runs (D/L method) 
CC,BB,CC won by 26 runs (D/L method) 
AA,BB,BB won by 56 runs 
CC,BB,CC won by 18 runs 

我需要选择的数值,并计算它们的平均通过TEAM1分组,TEAM2。

这是到现在为止。很多垃圾数据,因此我只筛选贫困记录!

df[df['outcome'].str.contains('runs',na=False)].head() 

我想要的结果:

team1 , team2 , AVG(NUMERIC COLUMN FROM 'OUTCOME') 

请建议!

回答

1

您可以使用extract与铸造int第一,然后groupby和聚集mean

df.outcome = df.outcome.str.extract('(\d+)', expand=False).astype(int) 
print (df.groupby(['team1','team2'], as_index=False)['outcome'].mean()) 
    team1 team2 outcome 
0 AA BB  73 
1 AA CC  19 
2 CC BB  22 

类似的解决方案:

s = df.outcome.str.extract('(\d+)', expand=False).astype(int) 
print (s.groupby([df['team1'],df['team2']]).mean().reset_index()) 
    team1 team2 outcome 
0 AA BB  73 
1 AA CC  19 
2 CC BB  22 
+0

谢谢,我会尝试了这一点。您能否告诉我们expand = False的意义是什么? – ANI

+0

它只是警告,'FutureWarning:目前提取(展开=无)意味着expand = False(返回Index/Series/DataFrame),但在未来版本的熊猫中,这将改为expand = True(返回DataFrame)' – jezrael

+0

伟大的,工作,感谢很多:) – ANI