我寻找可以被输送到一个正则表达式中的蜂巢RegexSerDe多行记录匹配
"input.regex"="the regex goes here"
条件的形式是“创建外部表”蜂巢QL的说法是,在文件中的日志,这些日志在RegexSerDe必须阅读有以下几种形式:
2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line.
2013-02-12 12:03:24,527 [DEBUG] 265y7d3e-432g-dfg3-dwq3-y4dsfq3ew91b Some other message that can contain any special character, including linebreaks. This one does not have one either. It just has spaces on the same line.
2013-02-12 12:03:24,946 [ERROR] 261rtd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks.
This is a special one.
This has a message that is multi-lined.
This is line number 4 of the same log.
Line 5.
2013-02-12 12:03:24,988 [INFO] 2632323e-432g-dfg3-dwq3-y4dsfq3ew91b Another 1-line log
2013-02-12 12:03:25,121 [DEBUG] 263tgd3e-432g-dfg3-dwq3-y4dsfq3ew91b Yet another one line log.
我使用以下命令来创建外部表的代码:
CREATE EXTERNAL TABLE applogs (logdatetime STRING, logtype STRING, requestid STRING, verbosedata STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(\\A[[0-9:-] ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) (.*)?(?=(?:\\A[[0-9:-] ]{19},[0-9]|\\z))",
"output.format.string" = "%1$s \\[%2$s\\] %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION 'hdfs:///logs-application';
这是事情:
它能够拉出每个日志的所有第一条线。但不是其他行有多于一行的日志。我尝试了所有链接,在末尾用\Z
代替\z
,用^
和\Z
或\z
替换\A
,用$
代替\A
,没有任何工作。我在output.format.string的%4$s
中错过了什么吗?或者我没有正确使用正则表达式?
正则表达式能做什么:
它的时间戳第一,其次是日志类型(DEBUG
或INFO
或其他),那么ID
其后内容是什么(小写字母,数字和连字符的组合)相匹配,直到找到下一个时间戳,或者直到找到与最后一个日志条目匹配的输入结束为止。我还尝试在最后添加/m
,在这种情况下,生成的表具有所有NULL值。
你为什么不排列那个宝贝? (大声笑这甚至不是动词,但stil ...你不能将每个人都设置为一个数组吗?那么第一行将是关键0,第二个多行项目将在1,另外两个在2和3你可以打电话给他们,只要你喜欢) – user1576978