2017-05-18 71 views
1

我最初的数据帧(DF)的返回日期时间列:熊猫据帧 - 在np.where声明

 column1  column2 column3 column4 
0 criteria_1 criteria_a 1/5/2017  5 
1 criteria_1 criteria_b 2/3/2017  3 
2 criteria_1 criteria_a 1/10/2017  10 
3 criteria_1 criteria_b 2/7/2017  7 
4 criteria_1 criteria_b 2/11/2017  11 
5 criteria_1 criteria_a 1/13/2017  13  

我的代码:

df = pd.read_csv("C:/Users/Desktop/maxtest.csv") 
    df['column3'] = pd.to_datetime(df['column3']) 
    df['max_column3'] = df.groupby(['column1','column2'])['column3'].transform(max) 
    df['max_column4'] = df.groupby(['column1','column2'])['column4'].transform(max) 
    df['test'] = np.where(df['column3'] < df['max_column3'],df['column3'],df['max_column4']) 

问题:

我创建了一个DF ['test']列,并希望在np.where语句为True时返回df ['column3']。当我尝试这个时,我收到“TypeError:invalid type promotion”错误。

我不完全确定是什么导致了错误。

+2

我认为问题在于你把np.where的结果混合在一起。有时它会在其他时间返回一个日期时间,它返回一个str或int。熊猫数据框和numpy NDarrays每列需要一个dtype。我能够通过df.column3上的.astype(str)解决此错误。 –

回答

0

查看我的评论的解释。

df['column3'] = pd.to_datetime(df['column3']) 
df['max_column3'] = df.groupby(['column1','column2'])['column3'].transform(max) 
df['max_column4'] = df.groupby(['column1','column2'])['column4'].transform(max) 
df['test'] = np.where((df['column3'] < df['max_column3']),df.column3.astype(str),df.max_column4) 

输出:

 column1  column2 column3 column4 max_column3 max_column4 \ 
0 criteria_1 criteria_a 2017-01-05  5 2017-01-13   13 
1 criteria_1 criteria_b 2017-02-03  3 2017-02-11   11 
2 criteria_1 criteria_a 2017-01-10  10 2017-01-13   13 
3 criteria_1 criteria_b 2017-02-07  7 2017-02-11   11 
4 criteria_1 criteria_b 2017-02-11  11 2017-02-11   11 
5 criteria_1 criteria_a 2017-01-13  13 2017-01-13   13 

     test 
0 2017-01-05 
1 2017-02-03 
2 2017-01-10 
3 2017-02-07 
4   11 
5   13 
0

如果你想保留的日期时间格式,你可以这样做:

df['test'] = df.apply(lambda x: x.column3 if x.column3 < x.max_column3 else x.max_column4, axis=1) 

df 
Out[1291]: 
     column1  column2 column3 column4 max_column3 max_column4 \ 
0 criteria_1 criteria_a 2017-01-05  5 2017-01-13   13 
1 criteria_1 criteria_b 2017-02-03  3 2017-02-11   11 
2 criteria_1 criteria_a 2017-01-10  10 2017-01-13   13 
3 criteria_1 criteria_b 2017-02-07  7 2017-02-11   11 
4 criteria_1 criteria_b 2017-02-11  11 2017-02-11   11 
5 criteria_1 criteria_a 2017-01-13  13 2017-01-13   13 

        test 
0 2017-01-05 00:00:00 
1 2017-02-03 00:00:00 
2 2017-01-10 00:00:00 
3 2017-02-07 00:00:00 
4     11 
5     13 
0

我最终使用的标准功能,做:

import pandas as pd 
import numpy as np 

    df = pd.read_csv("C:/Users/andre_000/Desktop/maxtest.csv") 
    df['column3'] = pd.to_datetime(df['column3']) 
    df['max_column3'] = df.groupby(['column1','column2'])['column3'].transform(max) 
    df['max_column4'] = df.groupby(['column1','column2'])['column4'].transform(max) 


    def func(row): 
     if row['column3'] < row['max_column3']: 
      return row['column3'] 
     else: 
      return row['max_column4'] 


    df = df.assign(test=df.apply(func, axis=1))