2017-03-16 39 views
0

我有一个数据帧,看起来像这样的熊猫数据帧填补数据:检查对特定列

import numpy as np 
raw_data = {'Series_Date':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'SP':[35.6,56.7,41,41],'1M':[-7.8,56,56,-3.4],'3M':[24,-31,53,5]} 
import pandas as pd 
df = pd.DataFrame(raw_data,columns=['Series_Date','SP','1M','3M']) 
print df 

我想在这个数据帧上运行某些列的测试而已,在这个列表中的所有列名:

check = {'1M','SP'} 
print check 

对于这些列,我想知道这些列中的值是否与前一天的值相同。因此输出数据框应该返回系列日期和评论,例如(在这种情况下:)

output_data = {'Series_Date':['2017-03-14','2017-03-15'],'Comment':["Value for 1M data is same as previous day","Value for SP data is same as previous day"]} 
output_data_df = pd.DataFrame(output_data,columns = ['Series_Date','Comment']) 
print output_data_df 

能否请您提供一些援助,如何应对这个例子吗?

回答

0

我不确定这是干净的方式。然而,它的工作原理

check = {'1M', 'SP'} 
prev_dict = {c: None for c in check} 

def check_prev_value(row): 
    global prev_dict 
    msg = "" 
    # MAYBE add clause to check if both are equal 
    for column in check: 
     if row[column] == prev_dict[column]: 
      msg = 'Value for %s data is same as previous day' % column 
     prev_dict[column] = row[column] 
    return msg 

df['comment'] = df.apply(check_prev_value, axis=1) 

output_data_df = df[df['comment'] != ""] 
output_data_df = output_data_df[["Series_Date", "comment"]].reset_index(drop=True) 

您的输入:

Series_Date SP 1M 3M 
0 2017-03-10 35.6 -7.8 24 
1 2017-03-13 56.7 56.0 -31 
2 2017-03-14 41.0 56.0 53 
3 2017-03-15 41.0 -3.4 5 

输出是:

Series_Date         comment 
0 2017-03-14 Value for 1M data is same as previous day 
1 2017-03-15 Value for SP data is same as previous day 
+0

Thanks bu如果我要检查其他专栏,如SP,SP和3M,该怎么办?我希望我的列可以根据“检查”列表中的列进行测试 – sg91

+0

我更新了代码。现在它将搜索出现在列表中的列 – AndreyF

0

下确实或多或少你想要什么。 列item_ok被添加到原始数据帧指定如果该值是相同的前一天或不:

from datetime import timedelta 
df['Date_diff'] = pd.to_datetime(df['Series_Date']).diff() 
for item in check: 
    df[item+'_ok'] = (df[item].diff() == 0) & (df['Date_diff'] == timedelta(1)) 
df_output = df.loc[(df[[item + '_ok' for item in check]]).any(axis=1)] 
0

参考:this answer

cols = ['1M','SP'] 
for col in cols: 
    df[col + '_dup'] = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount() 

输出列将具有整数大于零时更大的发现重复。

df: 

    Series_Date SP 1M 3M 1M_dup SP_dup 
0 2017-03-10 35.6 -7.8 24  0  0 
1 2017-03-13 56.7 56.0 -31  0  0 
2 2017-03-14 41.0 56.0 53  1  0 
3 2017-03-15 41.0 -3.4 5  0  1 

片找到的DUP:

col = 'SP' 
dup_df = df[df[col + '_dup'] > 0][['Series_Date', col + '_dup']] 

dup_df: 

    Series_Date SP_dup 
3 2017-03-15  1 

这里是上面的功能版本(与处理多个列的附加功能):

import pandas as pd 
import numpy as np 

def find_repeats(df, col_list, date_col='Series_Date'): 
    dummy_df = df[[date_col, *col_list]].copy() 
    dates = dummy_df[date_col] 
    date_series = [] 
    code_series = [] 
    if len(col_list) > 1: 
     for col in col_list: 
      these_repeats = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount().values 
      repeat_idx = list(np.where(these_repeats > 0)[0]) 
      date_arr = dates.iloc[repeat_idx] 
      code_arr = [col] * len(date_arr) 
      date_series.extend(list(date_arr)) 
      code_series.extend(code_arr) 
     return pd.DataFrame({date_col: date_series, 'col_dup': code_series}).sort_values(date_col).reset_index(drop=True) 
    else: 
     col = col_list[0] 
     dummy_df[col + '_dup'] = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount() 
     return dummy_df[dummy_df[col + '_dup'] > 0].reset_index(drop=True) 

find_repeats(df, ['1M']) 

    Series_Date 1M 1M_dup 
0 2017-03-14 56.0  1 

find_repeats(df, ['1M', 'SP']) 

    Series_Date col_dup 
0 2017-03-14  1M 
1 2017-03-15  SP 

这里是另一种方式使用熊猫差异:

def find_repeats(df, col_list, date_col='Series_Date'): 
    code_list = [] 
    dates = list() 

    for col in col_list: 
     these_dates = df[date_col].iloc[np.where(df[col].diff().values == 0)[0]].values 
     code_arr = [col] * len(these_dates) 
     dates.extend(list(these_dates)) 
     code_list.extend(code_arr) 
    return pd.DataFrame({date_col: dates, 'val_repeat': code_list}).sort_values(date_col).reset_index(drop=True)