我想解析由fidonet邮件binkd产生的日志文件,这是多和更糟糕 - 混合:几个实例可以写入日志文件一个,例如:Hadoop的多混合记录
27 Dec 16:52:40 [2484] BEGIN, binkd/1.0a-545/Linux -iq /tmp/binkd.conf
+ 27 Dec 16:52:40 [2484] session with 123.45.78.9 (123.45.78.9)
- 27 Dec 16:52:41 [2484] SYS BBSName
- 27 Dec 16:52:41 [2484] ZYZ First LastName
- 27 Dec 16:52:41 [2484] LOC City, Country
- 27 Dec 16:52:41 [2484] NDL 115200,TCP,BINKP
- 27 Dec 16:52:41 [2484] TIME Thu, 27 Dec 2012 21:53:22 +0600
- 27 Dec 16:52:41 [2484] VER binkd/0.9.6a-173/Win32 binkp/1.1
+ 27 Dec 16:52:43 [2484] addr: 2:1234/[email protected]
- 27 Dec 16:52:43 [2484] OPT NDA CRYPT
+ 27 Dec 16:52:43 [2484] Remote supports asymmetric ND mode
+ 27 Dec 16:52:43 [2484] Remote requests CRYPT mode
- 27 Dec 16:52:43 [2484] TRF 0 0
*+ 27 Dec 16:52:43 [1520] done (from 2:456/[email protected], OK, S/R: 0/0 (0/0 bytes))*
+ 27 Dec 16:52:43 [2484] Remote has 0b of mail and 0b of files for us
+ 27 Dec 16:52:43 [2484] pwd protected session (MD5)
- 27 Dec 16:52:43 [2484] session in CRYPT mode
+ 27 Dec 16:52:43 [2484] done (from 2:1234/[email protected], OK, S/R: 0/0 (0/0 bytes))
所以日志文件不仅有多行,每行会有不可预知的行数,而且还有几条记录可以混在一起,就像会话1520已经在会话2484中间完成一样。 在hadoop中解析这样一个正确的方向是什么文件?或者我应该只是逐行解析,然后将它们以某种方式合并到记录中,然后使用另一组作业将这些记录写入SQL数据库?
谢谢。