2017-03-04 135 views
6

我遵循Athena getting started guide并尝试解析我自己的Cloudfront日志。但是,这些字段没有被解析。亚马逊雅典娜不解析云端日志

我用一个小测试文件,内容如下:

#Version: 1.0 
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type 
2016-02-02 07:57:45 LHR5 5001 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /foo 404 - Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Error -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA== d3g47gpj5mj0b.cloudfront.net https421 0.076 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error 
2016-02-02 07:57:45 LHR5 1158241 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Miss oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA== d3g47gpj5mj0b.cloudfront.net https 419 0.440 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss 

而与此SQL创建的表:

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
    `Date` DATE, 
    Time STRING, 
    Location STRING, 
    Bytes INT, 
    RequestIP STRING, 
    Method STRING, 
    Host STRING, 
    Uri STRING, 
    Status INT, 
    Referrer STRING, 
    os STRING, 
    Browser STRING, 
    BrowserVersion STRING 
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
    WITH SERDEPROPERTIES (
    "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$" 
) LOCATION 's3://test/athena-csv/' 

但没有数据回来:

athena screen shot with no data

我可以看到它返回4行,但前两个应该排除,因为他们sta rt与#,所以这就像正则表达式不被正确解析。

我做错了什么?或者是正则表达式错误(似乎不太可能,因为它在文档中,对我来说看起来很好)?

回答

8

这是我结束了:

CREATE EXTERNAL TABLE logs (
    `date` date, 
    `time` string, 
    `location` string, 
    `bytes` int, 
    `request_ip` string, 
    `method` string, 
    `host` string, 
    `uri` string, 
    `status` int, 
    `referer` string, 
    `useragent` string, 
    `uri_query` string, 
    `cookie` string, 
    `edge_type` string, 
    `edget_requiest_id` string, 
    `host_header` string, 
    `cs_protocol` string, 
    `cs_bytes` int, 
    `time_taken` string, 
    `x_forwarded_for` string, 
    `ssl_protocol` string, 
    `ssl_cipher` string, 
    `result_type` string, 
    `protocol` string 
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (
    'input.regex' = '^(?!#.*)(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s*(\\S*)' 
) LOCATION 's3://logs' 

注意双反斜线是故意的。

云端日志的格式在某些时候更改以添加。这处理较旧和较新的文件。

+0

就像一个魅力。他们添加了一个新的GUI,所以它是相同的,期望他们现在有一个向导,允许你粘贴列的列表,如下所示: 'date date,time string,location string,bytes int,request_ip string,method string,主机字符串uri字符串状态int引用字符串useragent字符串uri_query字符串cookie字符串edge_type字符串edget_requiest_id字符串host_header字符串cs_protocol字符串cs_bytes int time_taken字符串x_forwarded_for字符串ssl_protocol字符串ssl_cipher字符串result_type字符串,协议字符串' –

0

该演示也不适用于我。玩了一下后,我得到了以下工作:

CREATE EXTERNAL TABLE IF NOT EXISTS DBNAME.TABLENAME (
    `date` date, 
    `time` string, 
    `location` string, 
    `bytes` int, 
    `request_ip` string, 
    `method` string, 
    `host` string, 
    `uri` string, 
    `status` int, 
    `referer` string, 
    `useragent` string, 
    `uri_query` string, 
    `cookie` string, 
    `edge_type` string, 
    `edget_requiest_id` string, 
    `host_header` string, 
    `cs_protocol` string, 
    `cs_bytes` int, 
    `time_taken` string, 
    `x_forwarded_for` string, 
    `ssl_protocol` string, 
    `ssl_cipher` string, 
    `result_type` string 
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (
    'serialization.format' = '1', 
    'input.regex' = '^(?!#.*)(?!#.*)([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)$' 
) LOCATION 's3://bucket/logs/'; 

用您的信息替换桶/日志和dbname.table。由于某些原因,它仍然为#行插入空行,但我得到了其余的数据。

我认为下一步就是尝试为用户代理或cookie创建一个。

+0

我不得不调整它以与您的工作。但现在应该是好的。 注意:\ \ s应该是\ s如果你有问题复制/粘贴 – CoderDan

1

拉我的头发了这一点,并提高对@CoderDans后回答:

秘诀是使用的\ t值分离,而不是\ S为正则表达式。

CREATE EXTERNAL TABLE IF NOT EXISTS mytablename (
    `date` date, 
    `time` string, 
    `location` string, 
    `bytes` int, 
    `request_ip` string, 
    `method` string, 
    `host` string, 
    `uri` string, 
    `status` int, 
    `referer` string, 
    `useragent` string, 
    `uri_query` string, 
    `cookie` string, 
    `edge_type` string, 
    `edget_request_id` string, 
    `host_header` string, 
    `cs_protocol` string, 
    `cs_bytes` int, 
    `time_taken` int, 
    `x_forwarded_for` string, 
    `ssl_protocol` string, 
    `ssl_cipher` string, 
    `result_type` string, 
    `protocol_version` string 
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (
    'serialization.format' = '1', 
    'input.regex' = '^(?!#.*)(?!#.*)([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)$' 
) LOCATION 's3://mybucket/myprefix/'; 
+0

谢谢Gregor。实际上'\ s'和'\ t'一样好,虽然都需要两个反斜杠。而不是'([^ \ t] +)','([\ S] +)'也可以。 – andrewrjones

+0

真的吗?对我来说,\ s没有工作,但没有。根据规格,格式是制表符分隔。 –

2

实际上,这里的所有答案都有一个小错误:第4个字段必须是BIGINT,而不是INT。否则,您的> 2GB文件请求未正确解析。经过与AWS Business Support的长时间讨论,看起来正确的格式是:

CREATE EXTERNAL TABLE your_table_name (
    `Date` DATE, 
    Time STRING, 
    Location STRING, 
    SCBytes BIGINT, 
    RequestIP STRING, 
    Method STRING, 
    Host STRING, 
    Uri STRING, 
    Status INT, 
    Referrer STRING, 
    UserAgent STRING, 
    UriQS STRING, 
    Cookie STRING, 
    ResultType STRING, 
    RequestId STRING, 
    HostHeader STRING, 
    Protocol STRING, 
    CSBytes BIGINT, 
    TimeTaken FLOAT, 
    XForwardFor STRING, 
    SSLProtocol STRING, 
    SSLCipher STRING, 
    ResponseResultType STRING, 
    CSProtocolVersion STRING 
) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
    LOCATION 's3://path_to_your_data_directory' 
+1

在此页面外,这是唯一对我有效的人。 –