2017-10-05 32 views
2

我有数据,我试图从中识别模式。但是,每个表中的数据都不完整(缺少行)。我想将表格分成完整的数据块,然后确定每个模式的模式。我有一列可以用来确定数据是否完整或未被调用sequenceSQL通过连续增加序列来分割数据,然后每个都通过一个模式子集

数据看起来就像是:

Sequence  Position 
1    open 
2    closed 
3    open 
4    open 
5    closed 
8    closed 
9    open 
11    open 
13    closed 
14    open 
15    open 
18    closed 
19    open 
20    closed 

首先,我想将数据分割成完整的部分:

Sequence  Position 
    1    open 
    2    closed 
    3    open 
    4    open 
    5    closed 
--------------------------- 
    8    closed 
    9    open 
--------------------------- 
    11    open 
--------------------------- 
    13    closed 
    14    open 
    15    open 
--------------------------- 
    18    closed 
    19    open 
    20    closed 

然后我想识别模式closed open, ..., open, closed这样才好从关闭到打开n行(其中n至少为1),然后返回关闭

从样本数据中可以看出:

 Sequence  Position 
     2    closed 
     3    open 
     4    open 
     5    closed 
    --------------------------- 
     18    closed 
     19    open 
     20    closed 

这使我可以进行分析的最终表格,因为我知道没有破碎的序列。如果这更容易处理,我还有另一列position是二进制文件。

表格很大,所以尽管我认为我可以编写循环来计算出我的结果,但我认为这种方法不够高效。另外我要整个表拉入R,然后找到结果表,但是这需要拉一切都变成R第一所以我不知道如果这是在SQL

编辑可行的:这是比较有代表性的不同样本数据:

Sequence  Position 
    1    open 
    2    closed 
    3    open 
    4    open 
    5    closed 
    8    closed 
    9    open 
    11    open 
    13    closed 
    14    open 
    15    open 
    18    closed 
    19    open 
    20    closed 
    21    closed 
    22    closed 
    23    closed 
    24    open 
    25    open 
    26    closed 
    27    open 

注意这应该有相同的结果,但也与

23    closed 
    24    open 
    25    open 
    26    closed 

212227不是因为他们不符合closedopen ... openclosed模式

但是如果我们28 closed我们希望2728因为没有时间间隔和图案将适合。如果不是28它是29 closed我们不希望2729(因为虽然模式是正确的序列中断)。

要添加一些上下文,请考虑从停止,运行到停止的计算机。我们记录了这些数据,但是在记录中存在空白,这些记录是通过破坏序列来表示的。以及停止运行停止循环中的数据丢失;数据有时会在机器已经运行时开始记录,或者在机器停止前停止记录。我不想要这些数据,因为它不是停止,运行,停止的完整循环。我只想要那些完整的周期,并且序列是连续的。 这意味着我可以将我的原始数据集转换为一个一个接一个完整的循环。

+0

我建议你设置SQL小提琴或Rextester。 –

+0

实际上你想要的是Spilled意味着什么?为此表格分配表格? –

+0

不只是一个'select'来过滤数据 – Olivia

回答

1

您可以使用它。

DECLARE @MyTable TABLE (Sequence INT, Position VARCHAR(10)) 

INSERT INTO @MyTable 
VALUES 
(1,'open'), 
(2,'closed') , 
(3,'open'), 
(4,'open'), 
(5,'closed'), 
(8,'closed'), 
(9,'open'), 
(11,'open'), 
(13,'closed'), 
(14,'open') , 
(15,'open'), 
(18,'closed'), 
(19,'open'), 
(20,'closed'), 
(21,'closed'), 
(22,'closed'), 
(23,'closed'), 
(24,'open'), 
(25,'open'), 
(26,'closed'), 
(27,'open') 


;WITH CTE AS(
    SELECT * , 
     CASE WHEN Position ='closed' AND LAG(Position) OVER(ORDER BY [Sequence]) ='closed' THEN 1 ELSE 0 END CloseMark 
    FROM @MyTable 
) 
,CTE_2 AS 
(
    SELECT 
     [New_Sequence] = [Sequence] + (SUM(CloseMark) OVER(ORDER BY [Sequence] ROWS UNBOUNDED PRECEDING)) 
     , [Sequence] 
     , Position 
    FROM CTE 
) 
,CTE_3 AS (
    SELECT *, 
    RN = ROW_NUMBER() OVER(ORDER BY [New_Sequence]) 
    FROM CTE_2 
) 
,CTE_4 AS 
(
    SELECT ([New_Sequence] - RN) G 
    , MIN(CASE WHEN Position = 'closed' THEN [Sequence] END) MinCloseSq 
    , MAX(CASE WHEN Position = 'closed' THEN [Sequence] END) MaxCloseSq 
    FROM CTE_3 
    GROUP BY ([New_Sequence] - RN) 
) 
SELECT 
    CTE.Sequence, CTE.Position 
FROM CTE_4 
    INNER JOIN CTE ON (CTE.Sequence BETWEEN CTE_4.MinCloseSq AND CTE_4.MaxCloseSq) 
WHERE 
    CTE_4.MaxCloseSq > CTE_4.MinCloseSq 
    AND (CTE_4.MaxCloseSq IS NOT NULL AND CTE_4.MinCloseSq IS NOT NULL) 

结果:

Sequence Position 
----------- ---------- 
2   closed 
3   open 
4   open 
5   closed 
---   --- 
18   closed 
19   open 
20   closed 
---   --- 
23   closed 
24   open 
25   open 
26   closed 
+0

这似乎不适用于我的真实数据。我的数据有更长时间的关闭和/或打开重复。但是格式是一样的。这是怎么回事? - 我说1000闭合,然后千开等 – Olivia

+0

你可以添加更多的测试数据? –

+0

对不起,我注意到它的数据就是这个问题。我使用循环创建序列(((round(convert(float,datetime),5) - 42961.58227)* 99999.97 + 1),1)'但注意到一些重复/奇怪的日期,所以即时只是要删除它们,然后再试一次 - 尽管 – Olivia

0

我觉得实际上有一个比较简单的方法来看待这个。您可以通过以下方法确定收盘顺序号:

  • 纵观前收盘
  • 望着累积的顺序为前收盘和当前接近
  • 做算术,以确保所有的中间体打开在数据

这变成了查询:

select t.*, 
     lag(sequence) over (partition by position order by sequence) as prev_sequence, 
     lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens 
from (select t.*, 
      sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens 
     from t 
    ) t 
where position = 'close' and 
     (cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and 
     sequence > prev_sequence - 1; 

现在你已经确定的顺序,你可以加入回去取原始行:

select t.* 
from t join 
    (select t.*, 
      lag(sequence) over (partition by position order by sequence) as prev_sequence, 
      lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens 
     from (select t.*, 
        sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens 
      from t 
      ) t 
     where position = 'close' and 
      (cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and 
      sequence > prev_sequence - 1 
    ) seqs 
    on t.sequence between seqs.prev_sequence and seqs.sequence; 

我承认我没有测试过这一点。不过,我确实认为这个想法很有效。一件事是它会为每个序列组选择多个“关闭”时段。

相关问题