2017-01-07 17 views
1

我正在尝试找到具有标准差的秒外离群点。我有两个数据框如下。我试图找到的异常值与周平均值相差1.5个标准差?当前代码低于数据框。找到数据的离群点

DF1:

name dateTime    Seconds 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
joe  2015-01-02 13:13:13 12345.0101 

电流输出:DF2

name day standardDev  mean   count 
Joe mon 22326.502700  40900.730647 1886 
     tue 9687.486726  51166.213836 159 
john mon 10072.707891  41380.035108 883 
     tue 5499.475345  26985.938776 196 

预期输出:

DF2

name day standardDev  mean   count  events 
Joe mon 22326.502700  40900.730647 1886  [2015-02-04 12:12:12, 2015-02-04 12:12:13] 
     tue 9687.486726  51166.213836 159  [2015-02-04 12:12:12, 2015-02-04 12:12:14] 
john mon 10072.707891  41380.035108 883  [2015-01-02 13:13:13, 2015-01-02 13:13:15] 
     tue 5499.475345  26985.938776 196  [2015-01-02 13:13:13, 2015-01-02 13:13:18] 

CODE:

allFiles = glob.glob(folderPath + "/*.csv") 
list_ = [] 
for file_ in allFiles: 
    df = pd.read_csv(file_, index_col=None, names=['EventTime', "IpAddress", "Hostname", "TargetUserName", "AuthenticationPackageName", "TargetDomainName", "EventReceivedTime"]) 
    df = df.ix[1:] 
    list_.append(df) 
df = pd.concat(list_) 
df['DateTime'] = pd.to_datetime(df['EventTime']) 
df['day_of_week'] = df.DateTime.dt.strftime('%a') 
df['seconds'] = pd.to_timedelta(df.DateTime.dt.time.astype(str)).dt.seconds 
print(df.groupby((['TargetUserName', 'day_of_week'])).agg({'seconds': {'mean': lambda x: (x.mean()), 'std': lambda x: (np.std(x)), 'count': 'count'}})) 
+0

也许'DF1 [df1.groupby(pd.DatetimeIndex(df.dateTime).dayofweek)[ '秒']应用(拉姆达×:X>(1.5 * x.std()+ x.mean ()))]'? – Abdou

+0

你究竟意味着什么“我不确定如何达到预期的产出”。 – Amjad

+0

我想弄清楚如何添加事件列并追踪1.5个标准偏差距离均值上下的所有事件?理想情况下,我想添加具有完整数据的任何行,这是在事件列的时间段之外作为事件列表。 – johnnyb

回答

1

这是从pandas docs轻微改编。我没有创建意思为& std的列,但是如果你想查看它,你可以很容易地添加它。

np.random.seed(1111) 
df=pd.DataFrame({ 'name':  ['joe','john']*30, 
        'dateTime': pd.date_range('1-1-2015',periods=60), 
        'Seconds': np.random.randn(60)+5000. }) 

grp = df.groupby(['name',df.dateTime.dt.dayofweek])['Seconds'] 
df['zscore'] = grp.transform(lambda x: (x-x.mean())/x.std()) 

df[ df['zscore'].abs() > 1.5 ] 
Out[79]: 
     Seconds dateTime name zscore 
1 4998.927011 2015-01-02 john -1.522488 
42 5001.275866 2015-02-12 joe 1.636829 
58 4999.124550 2015-02-28 joe -1.624945 

df.head(10) 
Out[80]: 
     Seconds dateTime name zscore 
0 4998.699990 2015-01-01 joe -0.959960 
1 4998.927011 2015-01-02 john -1.522488 
2 5000.790199 2015-01-03 joe 0.263690 
3 4999.121735 2015-01-04 john -1.005137 
4 5001.501822 2015-01-05 joe 1.132407 
5 4999.976071 2015-01-06 john 0.678951 
6 5000.275949 2015-01-07 joe 0.650297 
7 4999.033607 2015-01-08 john -0.964222 
8 4998.419685 2015-01-09 joe -1.328744 
9 4999.796325 2015-01-10 john 1.224198 
+0

是计算zscore对于该用户每周的每一天的每个用户?我试图根据他们的时间模式找出一周中特定日子的1.5以内的人。 – johnnyb

+1

是的。你可以像这样检查一个特定的人/星期几:'df [(df.dateTime.dt.dayofweek == 1)&(df.name =='joe')]'并且如果有使其更加清晰。 – JohnE