2016-02-08 39 views
0

我有一个数据框,我想删除包含异常值的任何一周。如果我能将整个星期表示为异常值,我会很高兴,因为我知道如何从那里做子集。我一直无法提出适当的解决方案。我一直在想,我需要循环几个星期才能达到预期的目标,或者创建一个单独的函数来处理单独的异常周和使用补给。我还没有使这些解决方案中的任何一个都可行。根据多个条件填充列的元素

date <- seq(as.Date("2015-01-01"), length=365, by="1 day") 
dow <- as.factor(weekdays(as.Date(date)) 
df <- data.frame(cbind(date, dow)) 
df$date <- as.Date(df$date,format="%m/%d/%Y",origin="01/01/1970") 
df$dow <- as.factor(weekdays(as.Date(df$date))) 
set.seed(1115) 
df$var1 <- rnorm(365, 1912, 40795) 
stdev <- sd(df$var1, na.rm=TRUE) 
avg <- mean(df$var1, na.rm=TRUE) 
df$LB <- avg-(2.75*stdev) 
df$UB <- avg+(2.75*stdev) 
df$outlier <- ifelse(df$var1<df$LB | df$var1>df$UB, 1,0) 
df$weeknum <- as.numeric(format(df$date, "%U")) 
head(df, 17) 

> head(df, 17) 
     date  dow  var1  LB  UB outlier weeknum 
1 2015-01-01 Thursday -7828.412 -114675.6 120479.8  0  0 
2 2015-01-02 Friday 25674.456 -114675.6 120479.8  0  0 
3 2015-01-03 Saturday -33588.871 -114675.6 120479.8  0  0 
4 2015-01-04 Sunday -54418.175 -114675.6 120479.8  0  1 
5 2015-01-05 Monday -10002.002 -114675.6 120479.8  0  1 
6 2015-01-06 Tuesday 34050.390 -114675.6 120479.8  0  1 
7 2015-01-07 Wednesday -37584.648 -114675.6 120479.8  0  1 
8 2015-01-08 Thursday 84048.878 -114675.6 120479.8  0  1 
9 2015-01-09 Friday -24801.346 -114675.6 120479.8  0  1 
10 2015-01-10 Saturday 33974.637 -114675.6 120479.8  0  1 
11 2015-01-11 Sunday 77432.088 -114675.6 120479.8  0  2 
12 2015-01-12 Monday 128196.236 -114675.6 120479.8  1  2 
13 2015-01-13 Tuesday 9740.418 -114675.6 120479.8  0  2 
14 2015-01-14 Wednesday 26539.887 -114675.6 120479.8  0  2 
15 2015-01-15 Thursday 12172.834 -114675.6 120479.8  0  2 
16 2015-01-16 Friday 1032.544 -114675.6 120479.8  0  2 
17 2015-01-17 Saturday 76870.095 -114675.6 120479.8  0  2 

在上面的例子中,期望的输出将是一个1与WEEKNUM对应每行中的异常值列= 2

+0

像这样的'df [df $ weeknum == 2&df $ outlier == 1]''? – Jimbou

+0

weeknum = 2应该是子集的唯一原因是异常发生在第12行的那一周。我想要创建的代码将在任何一周中找到异常点,并将整个一周的代码编码为异常值。数据集包含365行,因此上面的示例仅仅是前17行,恰好有一个异常值。 –

回答

0

答案包括:测试两个向量。一旦我意识到这一点,我能够改进我的搜索并找到合适的答案here

需要正确识别每个元素的代码是:

out.df <- df[which(df$outlier==1),]#Create a subset of only outlier rows 
df$outlier <- ifelse(df$weeknum %in% out.df$weeknum, 1, 0)#Compare the new data frame 
#weeknum against the old with the %in% operator, if they are equal leave 1, else 0. 

这给了结果:

> head(df, 17) 
     date  dow  var1  LB  UB outlier weeknum 
1 2015-01-01 Thursday -7828.412 -114675.6 120479.8  0  0 
2 2015-01-02 Friday 25674.456 -114675.6 120479.8  0  0 
3 2015-01-03 Saturday -33588.871 -114675.6 120479.8  0  0 
4 2015-01-04 Sunday -54418.175 -114675.6 120479.8  0  1 
5 2015-01-05 Monday -10002.002 -114675.6 120479.8  0  1 
6 2015-01-06 Tuesday 34050.390 -114675.6 120479.8  0  1 
7 2015-01-07 Wednesday -37584.648 -114675.6 120479.8  0  1 
8 2015-01-08 Thursday 84048.878 -114675.6 120479.8  0  1 
9 2015-01-09 Friday -24801.346 -114675.6 120479.8  0  1 
10 2015-01-10 Saturday 33974.637 -114675.6 120479.8  0  1 
11 2015-01-11 Sunday 77432.088 -114675.6 120479.8  1  2 
12 2015-01-12 Monday 128196.236 -114675.6 120479.8  1  2 
13 2015-01-13 Tuesday 9740.418 -114675.6 120479.8  1  2 
14 2015-01-14 Wednesday 26539.887 -114675.6 120479.8  1  2 
15 2015-01-15 Thursday 12172.834 -114675.6 120479.8  1  2 
16 2015-01-16 Friday 1032.544 -114675.6 120479.8  1  2 
17 2015-01-17 Saturday 76870.095 -114675.6 120479.8  1  2 

,令人满意。

0

你说“所期望的输出将是一个1中的离群值列与weeknum = 2对应的每一行。“你真的需要一个离群值列吗?好像你可以简单的子集的data.frame基础上的weeknum列的值,如下所示:

df <- df[!(df$weeknum==2),]