2012-09-22 29 views
0

首先关闭,抱歉没有任何可重现的数据,但我无法弄清楚如何重现此问题。但我会尽我所能,列出我已完成的工作以及任何相关信息。任何关于故障排除的想法将不胜感激。使用strptime会导致数据集中的重复日期

我的问题是这样的:

我有一个大的时间序列数据,我读到R.我最终转化为动物园,但现在我把它作为一个数据帧。使用read.csv我将数据读入R.使用str看看我得到这个数据:

> str(Met) 
'data.frame': 568354 obs. of 18 variables: 
$ time_local       : Factor w/ 568354 levels "2006-08-06 03:15:00",..: 1 2 3  4 5 6 7 8 9 10 ... 

注意 - 蛋氨酸$ time_local就是我所关心的,我已经删除了STR读出的所有其他列。

如果我搜索使用

Dup<-Met$time_local[duplicated(Met$time_local)] 

重复我什么也没得到

str(Dup) 
Factor w/ 568354 levels "2006-08-06 03:15:00",..: 

如果我使用strptime

MetStrp<-strptime(Met$time_local, "%Y-%m-%d %H:%M:%S") 
str(MetStrp) 
POSIXlt[1:568354], format: "2006-08-06 03:15:00" "2006-08-06 03:20:00" "2006-08-06 03:25:00" ... 

变换日期/时间数据到POSIXlt或POSIXct对象然后搜索复制品

Dup<-MetStrp[duplicated(MetStrp)] 
> head(Dup) 
[1] "2007-03-11 02:00:00" "2007-03-11 02:05:00" "2007-03-11 02:10:00" 
[4] "2007-03-11 02:15:00" "2007-03-11 02:20:00" "2007-03-11 02:25:00" 
> str(Dup) 
POSIXlt[1:60], format: "2007-03-11 02:00:00" "2007-03-11 02:05:00" "2007-03-11 02:10:00" ... 

我现在有60个副本(当我创建一个动物园对象时会抛出一些东西)。

有趣的是,如果我改变POSIXlt格式POSIXct

ct<-as.POSIXct(MetStrp) 
str(ct) 
POSIXct[1:568354], format: "2006-08-06 03:15:00" "2006-08-06 03:20:00" "2006-08-06 03:25:00" ... 

我得到同样的重复,但如果让我选择,寻找使用

重复的位置由一个小时

Dup<-ct[duplicated(ct)] 
> head(Dup) 
[1] "2007-03-11 01:00:00 PST" "2007-03-11 01:05:00 PST" "2007-03-11 01:10:00 PST" 
[4] "2007-03-11 01:15:00 PST" "2007-03-11 01:20:00 PST" "2007-03-11 01:25:00 PST" 
> str(Dup) 
POSIXct[1:60], format: "2007-03-11 01:00:00" "2007-03-11 01:05:00" "2007-03-11 01:10:00" ... 

偏移

Dup_loc<-which(duplicated(MetStrp) | duplicated(MetStrp,fromLast=TRUE)) 

我得到120个重复位置。最终成为POSIXlt和POSIXct重复项的组合。

str(Dup_loc) 
int [1:120] 62470 62471 62472 62473 62474 62475 62476 62477 62478 62479 ... 

随着POSIXct日期总是从小时1-2之中,而POSIClt日期总是从小时为2-3

要查看重复:

Test<-MetStrp[Dup_loc] 


>Test 
[1] "2007-03-11 01:00:00" "2007-03-11 01:05:00" "2007-03-11 01:10:00" 
[4] "2007-03-11 01:15:00" "2007-03-11 01:20:00" "2007-03-11 01:25:00" 
[7] "2007-03-11 01:30:00" "2007-03-11 01:35:00" "2007-03-11 01:40:00" 
[10] "2007-03-11 01:45:00" "2007-03-11 01:50:00" "2007-03-11 01:55:00" 
[13] "2007-03-11 02:00:00" "2007-03-11 02:05:00" "2007-03-11 02:10:00" 
[16] "2007-03-11 02:15:00" "2007-03-11 02:20:00" "2007-03-11 02:25:00" 
[19] "2007-03-11 02:30:00" "2007-03-11 02:35:00" "2007-03-11 02:40:00" 
[22] "2007-03-11 02:45:00" "2007-03-11 02:50:00" "2007-03-11 02:55:00" 
[25] "2008-03-09 01:00:00" "2008-03-09 01:05:00" "2008-03-09 01:10:00" 
[28] "2008-03-09 01:15:00" "2008-03-09 01:20:00" "2008-03-09 01:25:00" 
[31] "2008-03-09 01:30:00" "2008-03-09 01:35:00" "2008-03-09 01:40:00" 
[34] "2008-03-09 01:45:00" "2008-03-09 01:50:00" "2008-03-09 01:55:00" 
[37] "2008-03-09 02:00:00" "2008-03-09 02:05:00" "2008-03-09 02:10:00" 
[40] "2008-03-09 02:15:00" "2008-03-09 02:20:00" "2008-03-09 02:25:00" 
[43] "2008-03-09 02:30:00" "2008-03-09 02:35:00" "2008-03-09 02:40:00" 
[46] "2008-03-09 02:45:00" "2008-03-09 02:50:00" "2008-03-09 02:55:00" 
[49] "2009-03-08 01:00:00" "2009-03-08 01:05:00" "2009-03-08 01:10:00" 
[52] "2009-03-08 01:15:00" "2009-03-08 01:20:00" "2009-03-08 01:25:00" 
[55] "2009-03-08 01:30:00" "2009-03-08 01:35:00" "2009-03-08 01:40:00" 
[58] "2009-03-08 01:45:00" "2009-03-08 01:50:00" "2009-03-08 01:55:00" 
[61] "2009-03-08 02:00:00" "2009-03-08 02:05:00" "2009-03-08 02:10:00" 
[64] "2009-03-08 02:15:00" "2009-03-08 02:20:00" "2009-03-08 02:25:00" 
[67] "2009-03-08 02:30:00" "2009-03-08 02:35:00" "2009-03-08 02:40:00" 
[70] "2009-03-08 02:45:00" "2009-03-08 02:50:00" "2009-03-08 02:55:00" 
[73] "2010-03-14 01:00:00" "2010-03-14 01:05:00" "2010-03-14 01:10:00" 
[76] "2010-03-14 01:15:00" "2010-03-14 01:20:00" "2010-03-14 01:25:00" 
[79] "2010-03-14 01:30:00" "2010-03-14 01:35:00" "2010-03-14 01:40:00" 
[82] "2010-03-14 01:45:00" "2010-03-14 01:50:00" "2010-03-14 01:55:00" 
[85] "2010-03-14 02:00:00" "2010-03-14 02:05:00" "2010-03-14 02:10:00" 
[88] "2010-03-14 02:15:00" "2010-03-14 02:20:00" "2010-03-14 02:25:00" 
[91] "2010-03-14 02:30:00" "2010-03-14 02:35:00" "2010-03-14 02:40:00" 
[94] "2010-03-14 02:45:00" "2010-03-14 02:50:00" "2010-03-14 02:55:00" 
[97] "2011-03-13 01:00:00" "2011-03-13 01:05:00" "2011-03-13 01:10:00" 
[100] "2011-03-13 01:15:00" "2011-03-13 01:20:00" "2011-03-13 01:25:00" 
[103] "2011-03-13 01:30:00" "2011-03-13 01:35:00" "2011-03-13 01:40:00" 
[106] "2011-03-13 01:45:00" "2011-03-13 01:50:00" "2011-03-13 01:55:00" 
[109] "2011-03-13 02:00:00" "2011-03-13 02:05:00" "2011-03-13 02:10:00" 
[112] "2011-03-13 02:15:00" "2011-03-13 02:20:00" "2011-03-13 02:25:00" 
[115] "2011-03-13 02:30:00" "2011-03-13 02:35:00" "2011-03-13 02:40:00" 
[118] "2011-03-13 02:45:00" "2011-03-13 02:50:00" "2011-03-13 02:55:00" 

至于我可以看到,我没有看到上面有任何重复的时间戳。所以我不确定是怎么回事,但有些不对劲。

据我所知,我所做的一切都是将一个因素数据集转换为基于时间的数据集。所以我不知道为什么我在动物园里得到一个重复的错误,并且在没有出现任何问题时使用duplicated找到重复项。

再次,任何想法在这个问题将不胜感激。

回答

2

我有三个词给你:“夏令时”。我根据所提供的证据预测,在您的区域内,2007年3月11日是夏令时间转换发生的日期。注意它们发生在1-2 AM的时间范围内。

+0

干得不错。我认为这是问题的根源。但是,使用strptime时设置tz =“”似乎不能解决问题。 – Vinterwoo

+0

忘记所有重复的等电话。只需转换为POSIXct,然后看看'diff(x)'。如果你让R正确解析日期,它将会在DST差异上得到答案。 –

+0

尽管设置tz =“UTC”确实会消除DST问题并删除重复项,使动物园再次开心。谢谢迪文,并感谢所有研究这个问题的人 – Vinterwoo

相关问题