2015-04-16 36 views
0

我有一个服务可以计算项目的一堆东西。用户可以每天多次触发该计算。每个计算都会产生一些有趣的指标(我们称之为A,B,C)。如何处理多个关联的日志消息?

我将这些指标报告给具有单独日志消息的日志服务。日志消息如下所示:

date | calculationID1 | projectID1 | metricA | valueA 
date | calculationID1 | projectID1 | metricB | valueB 
date | calculationID1 | projectID1 | metricC | valueC 
date | calculationID2 | projectID2 | metricA | valueA 
date | calculationID2 | projectID2 | metricB | valueB 
date | calculationID2 | projectID2 | metricC | valueC 
date | calculationID3 | projectID1 | metricA | valueA 
date | calculationID3 | projectID1 | metricB | valueB 
date | calculationID3 | projectID1 | metricC | valueC 

在此示例中,ID为1的项目在该特定日期运行了两次。在我的分析后端我有一个蜂巢集群来分析这些数据,我想生成与上次报告指标表为每个项目的某一天一天:

date | calculationID3 | projectID1 | valueA | valueB | valueC 
date | calculationID2 | projectID2 | valueA | valueB | valueC 

显然,这种计算是非常昂贵的,因为我做的很多连接。我的公司有严格的日志记录格式,这就是为什么我为每个日志消息创建一个值。我是否应该创建一条包含所有指标的日志消息来缓解报告?

任何人都可以指出我对这些问题的最佳做法吗?

回答

0

如果我们使用DB,在SQL中支持PIVOT clause,那么我们可以使用以下查询从日志报告中收集数据。

无需PIVOT即可获取相同的结果,但另一种方式需要大量的复制粘贴和杂耍,并且由于您的帐号是"pragmatic with implementation",所以我想我们不需要谈论那些肮脏的东西。

要看看发生了什么在查询中,你可以做3个步骤:

  • 运行查询,而不PIVOT(只是删除PIVOT关键字和其他代码)
  • 然后运行它是
  • 比较第一和第二步骤的结果,识别如何行被转置到列

WITH 
    data_table (stamp, calculation_ID, project_ID, metric_name, metric_value) as (select 

     timestamp '2015-01-01 00:00:01', 'calc_ID_1', 'project_WHITE', 'metric_A', 11 from dual union all select 
     timestamp '2015-01-01 00:00:02', 'calc_ID_1', 'project_WHITE', 'metric_B', 21 from dual union all select 
     timestamp '2015-01-01 00:00:03', 'calc_ID_1', 'project_WHITE', 'metric_C', 31 from dual union all select 
     timestamp '2015-01-01 00:01:04', 'calc_ID_2', 'project_WHITE', 'metric_A', 12 from dual union all select 
     timestamp '2015-01-01 00:01:05', 'calc_ID_2', 'project_WHITE', 'metric_B', 22 from dual union all select 
     timestamp '2015-01-01 00:01:06', 'calc_ID_2', 'project_WHITE', 'metric_C', 32 from dual union all select 

     timestamp '2015-01-01 00:00:11', 'calc_ID_3', 'project_BLACK', 'metric_A', 41 from dual union all select 
     timestamp '2015-01-01 00:00:12', 'calc_ID_3', 'project_BLACK', 'metric_B', 51 from dual union all select 
     timestamp '2015-01-01 00:00:13', 'calc_ID_3', 'project_BLACK', 'metric_C', 61 from dual union all select 
     timestamp '2015-01-01 00:01:14', 'calc_ID_4', 'project_BLACK', 'metric_A', 42 from dual union all select 
     timestamp '2015-01-01 00:01:15', 'calc_ID_4', 'project_BLACK', 'metric_B', 52 from dual union all select 
     timestamp '2015-01-01 00:01:16', 'calc_ID_4', 'project_BLACK', 'metric_C', 62 from dual  
) 
SELECT * 
    FROM (
     select trunc(stamp) AS day, 
      calculation_id, 
      project_id, 
      metric_name, 
      metric_value 
     from (
     select t.*, 
       rank() OVER (PARTITION BY project_ID, metric_name, trunc(stamp) ORDER BY stamp DESC) calculation_rank 
     from data_table t 
     -- take only the last log row for (project_ID, metric_name) for every given day 
    ) where calculation_rank = 1 
) 
PIVOT (
    -- aggregate function is required here, 
    -- and SUM can be replaced with something more relevant to custom logic 
    SUM(metric_value) 
    FOR 
    metric_name IN ('metric_A' AS "Metric A", 
        'metric_B' AS "Metric B", 
        'metric_C' AS "Metric C") 
); 

结果:

DAY  | CALCULATION_ID | PROJECT_ID | Metric A | Metric B | Metric C 
------------------------------------------------------------------------------ 
2015-01-01 | calc_ID_4  | project_BLACK | 42  | 52  | 62 
2015-01-01 | calc_ID_2  | project_WHITE | 12  | 22  | 32 

在此查询calculation_ID是多余的(I仅将它用于使例如用于读码器更清晰)。但是,您仍然可以应用此信息来检查日志记录数据格式的完整性,并探究是否相同calculation_ID对应于同一组/时间段中涉及的度量标准。

+0

@Lars Schneider,你对这个解决方案有什么看法? – diziaq