2012-10-17 30 views
4

我需要根据位置上的日期对一些数据进行分组,包括识别日期范围没有位置。我在那里的一些方法是设法生成范围和位置中所有日期的列表。在具有多个重复数据组的列上分组

  • 日期1 LOCATION1
  • date2的LOCATION1
  • DATE3 LOCATION1
  • date4未知
  • date5未知
  • date6未知
  • date7 LOCATION2
  • date8 LOCATION2
  • date9 LOCATION2
  • date10 LOCATION2
  • date11 LOCATION1
  • date12 LOCATION1
  • date13 LOCATION1

使用正常组由(显示分钟(日期)和max(日期),我会得到这样的:

  • Location1,date1,date13
  • Location2,date7,date10
  • 未知,date4,date6

但我想这一点:

  • LOCATION1,DATE1,DATE3
  • 未知,date4,date6
  • LOCATION2,date7,date9
  • LOCATION1 ,date11,date13

我也需要过滤掉未知的短距离,但这是次要的。

我希望这是有道理的,它看起来应该很容易。

+0

这是一个缺口和孤岛问题。我不知道如何添加标签,虽然...也许它会得到自动应用,现在这个评论提到它?http://stackoverflow.com/questions/tagged/gaps-and-islands – Anssssss

回答

1

看一看岛和差距问题和Itzik本甘。有一套基于方式来获得你想要的结果。

我正在研究使用ROW_NUMBER或RANK,但后来我偶然发现了LAG和LEAD(在SQL 2012中引入),这很好。我有下面的解决方案。它绝对可以简化,但作为几个CTE让我的思维过程(尽可能有缺陷)更容易看到。我只是慢慢地将数据转换成我想要的。如果您想查看每个新的CTE生成的内容,请一次取消选择一个选择的注释。

create table Junk 
(aDate Datetime, 
aLocation varchar(32)) 

insert into Junk values 
('2000', 'Location1'), 
('2001', 'Location1'), 
('2002', 'Location1'), 
('2004', 'Unknown'), 
('2005', 'Unknown'), 
('2006', 'Unknown'), 
('2007', 'Location2'), 
('2008', 'Location2'), 
('2009', 'Location2'), 
('2010', 'Location2'), 
('2011', 'Location1'), 
('2012', 'Location1'), 
('2013', 'Location1'), 
('2014', 'Location3') 


;WITH StartsMiddlesAndEnds AS 
(
    select 
    aLocation, 
    aDate, 
    CASE(LAG(aLocation) OVER (ORDER BY aDate, aLocation)) WHEN aLocation THEN 0 ELSE 1 END [isStart], 
    CASE(LEAD(aLocation) OVER (ORDER BY aDate, aLocation)) WHEN aLocation THEN 0 ELSE 1 END [isEnd] 
    from Junk 
) 
--select * from NumberedStartsMiddlesAndEnds 
,NumberedStartsAndEnds AS --let's get rid of the rows that are in the middle of consecutive date groups 
(
    select 
    aLocation, 
    aDate, 
    isStart, 
    isEnd, 
    ROW_NUMBER() OVER(ORDER BY aDate, aLocation) i 
    FROM StartsMiddlesAndEnds 
    WHERE NOT(isStart = 0 AND isEnd = 0) --it is a middle row 
) 
--select * from NumberedStartsAndEnds 
,CombinedStartAndEnds AS --now let's put the start and end dates in the same row 
(
    select 
    rangeStart.aLocation, 
    rangeStart.aDate [aStart], 
    rangeEnd.aDate [aEnd] 
    FROM NumberedStartsAndEnds rangeStart 
    join NumberedStartsAndEnds rangeEnd ON rangeStart.aLocation = rangeEnd.aLocation 
    WHERE rangeStart.i = rangeEnd.i - 1 --consecutive rows 
    and rangeStart.isStart = 1 
    and rangeEnd.isEnd = 1 
) 
--select * from CombinedStartAndEnds 
,OneDateIntervals AS --don't forget the cases where a single row is both a start and end 
(
    select 
    aLocation, 
    aDate [aStart], 
    aDate [aEnd] 
    FROM NumberedStartsAndEnds 
    WHERE isStart = 1 and isEnd = 1 
) 
--select * from OneDateIntervals 
select aLocation, DATEPART(YEAR, aStart) [start], DATEPART(YEAR, aEnd) [end] from OneDateIntervals 
UNION 
select aLocation, DATEPART(YEAR, aStart) [start], DATEPART(YEAR, aEnd) [end] from CombinedStartAndEnds 
ORDER BY DATEPART(YEAR, aStart) 

和它产生

aLocation start end 
Location1 2000 2002 
Unknown 2004 2006 
Location2 2007 2010 
Location1 2011 2013 
Location3 2014 2014 

不要有2012?那么你仍然可以使用ROW_NUMBER获得相同的StartsMiddlesAndEnds CTE:

;WITH NumberedRows AS 
(
    SELECT aLocation, aDate, ROW_NUMBER() OVER (ORDER BY aDate, aLocation) [i] FROM Junk 
) 
,StartsMiddlesAndEnds AS 
(
    select 
    currentRow.aLocation, 
    currentRow.aDate, 
    CASE upperRow.aLocation WHEN currentRow.aLocation THEN 0 ELSE 1 END [isStart], 
    CASE lowerRow.aLocation WHEN currentRow.aLocation THEN 0 ELSE 1 END [isEnd] 
    from 
    NumberedRows currentRow 
    left outer join NumberedRows upperRow on upperRow.i = currentRow.i-1 
    left outer join NumberedRows lowerRow on lowerRow.i = currentRow.i+1 
) 
--select * from StartsMiddlesAndEnds 
+0

它看起来不错,但我们'仍然在2008年,我得到了一个并行数据仓库错误(这是一个新的我)。不幸的是,我也错过了其他重要的内容,我将分别概述。 – Deadeye

+0

我在2008年添加了一种方法(即不使用LAG和LEAD)。 – Anssssss