识别

2014-01-09 68 views
0

我花了相当长的一段时间处理有N组的边界如下:识别

假设你有ň与多个记录每一个记录都有独特startingending点组的数目。

换句话说:

ID|GroupName|StartingPoint|EndingPoint|seq(row_number)|desired_seq 
__|_________|_____________|___________|_______________|____________ 
1 | Grp1 |2014-01-06 |2014-01-07 |1    |1 
__|_________|_____________|___________|_______________|____________ 
2 | Grp1 |2014-01-07 | 2014-01-08|2    |2 
__|_________|_____________|___________|_______________|____________ 
3 | Grp2 |2014-01-08 | 2014-01-09|1    |1 
__|_________|_____________|___________|_______________|____________ 
4 | Grp1 |2014-01-09 | 2014-01-10|3    |1 
__|_________|_____________|___________|_______________|____________ 
5 | Grp2 |2014-01-10 | 2014-01-11|2    |1 
__|_________|_____________|___________|_______________|____________ 

正如你所看到的,starting point每一个连续的记录是相同以前的ending point

基本上,我想根据日期为每个组获得minimumS and maximumS。一旦出现带有新组名称的记录,则将其视为新组并重置排序。

row_number()功能不是此任务足以因为它不反映在组名称的变化。(我已经包含在采样数据一SEQ列表示由行数所产生的值)

期望结果根据样本数据:

1 Grp1 |2014-01-06 | 2014-01-08 
2 Grp2 |2014-01-08 | 2014-01-09 
3 Grp1 |2014-01-09 | 2014-01-10 
4 Grp2 |2014-01-10 | 2014-01-11 

我曾尝试:

;with cte as(
select * 
, row_number() over (partition by GroupName order by startingpoint) as seq 
from table1 
) 
select * 
into #temp2 
from cte t1 
left join cte t2 on t1.id=t2.id and t1.seq= t2.seq-1 

select * 
,(select startingPoint from #temp2 t2 where t1.id=t2.id and t2.seq= (select MIN(seq) from #temp2) as Oldest 
(select startingPoint from #temp2 t2 where t1.id=t2.id and t2.seq= (select MAX(seq) from #temp2) as MostRecent 
from #temp2 t1 
+0

从表格判断,似乎你可以使用'MIN'和'MAX',除非我失去了一些东西。 – Zane

回答

3

这是一个gaps-and-islands问题亚组。诀窍是按两个ROW_NUMBER()值之差进行分组,一个分区和一个未分区。

WITH t AS (
    SELECT 
    GroupName, 
    StartingPoint, 
    EndingPoint, 
    ROW_NUMBER() OVER(PARTITION BY GroupName ORDER BY StartingPoint) 
     - ROW_NUMBER() OVER(ORDER BY StartingPoint) AS SubGroupId 
    FROM #test 
) 
SELECT 
    ROW_NUMBER() OVER (ORDER BY MIN(StartingPoint)) AS SortOrderId, 
    GroupName          AS GroupName, 
    MIN(StartingPoint)        AS GroupStartingPoint, 
    MAX(EndingPoint)        AS GroupEndingPoint 
FROM t 
GROUP BY GroupName, SubGroupId 
ORDER BY SortOrderId 
0

不知道,但也许:

SELECT DISTINCT 
    GroupName, 
    MIN(StartingPoint) OVER (PARTITION BY GroupName ORDER BY Id), 
    MAX(EndingPoint) OVER (PARTITION BY GroupName ORDER BY Id) 
FROM table1 

因为partition不会导致会出现原本复制的行数项,这与distinct去除的减少。

0

这是所以用SQL Server 2012中的lag()功能要容易得多。我处理这些问题的方法是找到组的起始位置,为每行分配一个1或0的标志。然后累计总和1 s以获得新的组ID。

在SQL Server 2008中,您可以用相关子查询做到这一点(或连接):

with table1_flag as (
     select t1.*, 
      isnull((select top 1 1 
        from table1 t2 
        where t2.groupname = t1.groupname and 
          t2.endingpoint = t1.startingpoint 
        ), 0) as groupstartflag 
     from table1 t1 
    ), 
    table1_flag_cum as (
     select tf.*, 
      (select sum(groupstartflag) 
       from table1_flag tf2 
       where tf2.groupname = tf.groupname and 
        tf2.startingpoint <= tf.startingpoint 
      ) as groupnum 
     from table1_flag tf 
    ) 
select groupnum, groupname, 
     min(startingpoint) as startingpoint, max(endingpoint) as endingpoint 
from table1_flag_cum 
group by groupnum, groupname; 
+0

感谢您的帮助。我测试了查询[SQLFiddle](http://sqlfiddle.com/#!3/87a45/2),但无法根据我的要求对其进行调整。您的查询返回Grp1的07-10和Grp2的08-11,这意味着grps2包含在grp1 –

+0

@Kiril中。 。 。它包括每个比较中的'groupname',包括最后的'group by'。这些小组不应该互相干扰。 –

+0

嗯。这是我所期望的,但是,我仍然在与同一日期相关联的多个组进行操作。 –