2017-07-20 24 views
0

我有一个(非常大)表像​​这样OLAP功能处理 - 为什么快上N/M分区M倍比N个记录运行1次

CREATE SET TABLE LOAN 
    (LoanNumber VARCHAR(100), 
    LoanBalance DECIMAL(18,4), 
    RecTimeStamp TIMESTAMP(0) 
) 
PRIMARY INDEX (LoanNumber) 
PARTITION BY RANGE_N 
    (ROW_INS_TS BETWEEN 
     TIMESTAMP '2017-01-01 00:00:00+00:00' 
    AND TIMESTAMP '2017-12-31 23:59:59+00:00' 
    EACH INTERVAL '1' DAY 
); 

通常此表获取通过快照卷起来,例如4月份月底快照将是

-- Pretend there is just 2017 data there 
CREATE SET TABLE LOAN_APRIL AS 
    (SELECT * 
     FROM LOAN 
    WHERE RecTimeStamp <= DATE '2017-04-30' 
    QUALIFY row_number() OVER 
      (PARTITION BY LoanNumber 
        ORDER BY RecTimeStamp DESC 
      ) = 1 
) 
PRIMARY INDEX (LoanNumber); 

这通常需要相当长时间才能运行。我虽然昨天的实验,发现我打破它拆开,这样

CREATE SET TABLE LOAN_APRIL_TMP 
    (LoanNumber VARCHAR(100), 
    LoanBalance DECIMAL(18,4), 
    RecTimeStamp TIMESTAMP(0) 
) 
PRIMARY INDEX (LoanNumber); 

CREATE SET TABLE LOAN_APRIL 
    (LoanNumber VARCHAR(100), 
    LoanBalance DECIMAL(18,4), 
    RecTimeStamp TIMESTAMP(0) 
) 
PRIMARY INDEX (LoanNumber); 

INSERT INTO LOAN_APRIL_TMP 
    SELECT * 
     FROM LOAN 
    WHERE RecTimeStamp BETWEEN DATE '2017-01-01' AND DATE '2017-01-31' 
    QUALIFY row_number() OVER 
      (PARTITION BY LoanNumber 
        ORDER BY RecTimeStamp DESC 
      ) = 1; 

INSERT INTO LOAN_APRIL_TMP 
    SELECT * 
     FROM LOAN 
    WHERE RecTimeStamp BETWEEN DATE '2017-02-01' AND DATE '2017-02-28' 
    QUALIFY row_number() OVER 
      (PARTITION BY LoanNumber 
        ORDER BY RecTimeStamp DESC 
      ) = 1; 

INSERT INTO LOAN_APRIL_TMP 
    SELECT * 
     FROM LOAN 
    WHERE RecTimeStamp BETWEEN DATE '2017-03-01' AND DATE '2017-03-31' 
    QUALIFY row_number() OVER 
      (PARTITION BY LoanNumber 
        ORDER BY RecTimeStamp DESC 
      ) = 1; 

INSERT INTO LOAN_APRIL_TMP 
    SELECT * 
     FROM LOAN 
    WHERE RecTimeStamp BETWEEN DATE '2017-04-01' AND DATE '2017-04-30' 
    QUALIFY row_number() OVER 
      (PARTITION BY LoanNumber 
        ORDER BY RecTimeStamp DESC 
      ) = 1; 

INSERT INTO LOAN_APRIL 
    SELECT * 
     FROM LOAN_APRIL_TMP 
    QUALIFY row_number() OVER 
      (PARTITION BY LoanNumber 
        ORDER BY RecTimeStamp DESC 
      ) = 1; 

我只是跑这个顺序有很好的执行时间,所以他们没有并行执行。今天我要试验看看如何让每个片段并行加载。

此外,对于更大的一点,我无法找到足够的技术文档来确定这些类型的问题。有这方面的好资源吗?我知道有很多适当的问题,但必须有一些内容至少在高层次上描述这些功能的实施。

回答

2

可能有多种原因。您应该检查DBQL以查看实际的资源使用差异。

  • 1st Select中的数据分散在比那些较小的Selects更多的分区上。

  • 说明可能会显示假脱机不会在内存中分配用于大选,但不适用于单独的内存。

  • VARCHAR处理中order by被扩展到规定尺寸的字符数,如果LoanNumber实际上是一个VarChar(100)(我怀疑它是)它会增加阀芯,太(但这是针对该表中的其他查询一个共同的问题)。

OLAP函数有一个缺点,它们需要两个线轴,即线轴大小的两倍。如果这个表有很多列/大排它可能是更有效,只对表的PK运行ROW_NUMBER然后再加入类似下面的:

CREATE SET TABLE LOAN_APRIL_TMP 
    (LoanNumber VARCHAR(100), 
    RecTimeStamp TIMESTAMP(0) 
) 
PRIMARY INDEX (LoanNumber) -- same PPI as source table to facilitate fast join back 
PARTITION BY RANGE_N 
    (ROW_INS_TS BETWEEN 
     TIMESTAMP '2017-01-01 00:00:00+00:00' 
    AND TIMESTAMP '2017-12-31 23:59:59+00:00' 
    EACH INTERVAL '1' DAY 
); 

INSERT INTO LOAN_APRIL_TMP 
SELECT LoanNumber, RecTimeStamp -- no other columns 
FROM LOAN 
WHERE RecTimeStamp <= DATE '2017-04-30' 
QUALIFY row_number() OVER 
      (PARTITION BY LoanNumber 
        ORDER BY RecTimeStamp DESC 
      ) = 1 
; 

INSERT INTO LOAN_APRIL 
SELECT l.* -- now get all columns 
FROM LOAN AS l 
JOIN LOAN_APRIL_TMP AS AS tmp 
    ON l.LoanNumber = tmp.LoanNumber 
AND l.RecTimeStamp = tmp.RecTimeStamp 
+0

@YellowBedwetter:能否请您以后添加一些信息测试这实际上是否改善了性能? – dnoeth

相关问题