2012-07-17 124 views
1

我有内容这样的日志文件:正则表达式来捕获文本

2012-07-16 03:20:41,23796160897,Text,id:SAR-23796160897-c0-2-1 sub:000 dlvrd:001 submit date:120715220216 done date:120716032038 stat:DELIVRD err:000 text:,FOTSO TOKAM,SMSCReceiptMsgId=SAR-23796160897-c0-2-1 
2012-07-16 03:20:48,23796160897,Text,id:SAR-23796160897-c0-2-2 sub:000 dlvrd:001 submit date:120715220216 done date:120716032045 stat:DELIVRD err:000 text:,FOTSO TOKAM,SMSCReceiptMsgId=SAR-23796160897-c0-2-2 
2012-05-04 00:07:46,23777603300,Text,id:4FA23EB0 sub:000 dlvrd:001 submit date:120503225018 done date:120504000744 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA23EB0 
2012-05-04 01:50:18,23796726987,Text,id:4FA23E95 sub:000 dlvrd:001 submit date:120503225014 done date:120504015016 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA23E95 
2012-05-04 01:50:22,23799757015,Text,id:4FA23EB2 sub:000 dlvrd:001 submit date:120503225018 done date:120504015021 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA23EB2 
2012-05-04 01:50:48,23799907239,Text,id:4FA23F38 sub:000 dlvrd:001 submit date:120503225042 done date:120504015046 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA23F38 
2012-05-04 01:50:48,23799896455,Text,id:4FA23D1C sub:000 dlvrd:001 submit date:120503175232 done date:120504015046 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA23D1C 
2012-05-04 01:50:48,23799896455,Text,id:4FA23F04 sub:000 dlvrd:001 submit date:120503225031 done date:120504015046 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA23F04 
2012-05-04 01:50:50,23794105044,Text,id:4FA23F55 sub:000 dlvrd:001 submit date:120503225046 done date:120504015048 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA23F55 
2012-05-04 01:51:19,23796029764,Text,id:4FA23FEE sub:000 dlvrd:001 submit date:120503225114 done date:120504015117 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA23FEE 
2012-05-04 02:17:51,23775461594,Text,id:4FA24025 sub:000 dlvrd:001 submit date:120503225125 done date:120504021749 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA24025 
2012-05-04 04:08:02,23777437781,Text,id:4FA23F23 sub:000 dlvrd:001 submit date:120503225037 done date:120504040800 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA23F23 
2012-05-04 04:50:12,23777970013,Text,id:4FA23E70 sub:000 dlvrd:000 submit date:120503225005 done date:120504045011 stat:EXPIRED err:027 text:,FLP,SMSCReceiptMsgId=4FA23E70 
2012-05-04 04:50:15,23775182832,Text,id:4FA23E7E sub:000 dlvrd:000 submit date:120503225008 done date:120504045014 stat:EXPIRED err:027 text:,FLP,SMSCReceiptMsgId=4FA23E7E 
2012-05-04 04:50:17,23777789644,Text,id:4FA23E80 sub:000 dlvrd:000 submit date:120503225010 done date:120504045016 stat:EXPIRED err:027 text:,FLP,SMSCReceiptMsgId=4FA23E80 
2012-05-04 04:50:21,23777529371,Text,id:4FA23E8F sub:000 dlvrd:000 submit date:120503225013 done date:120504045019 stat:EXPIRED err:027 text:,FLP,SMSCReceiptMsgId=4FA23E8F 
2012-05-04 04:50:21,23777613852,Text,id:4FA23E97 sub:000 dlvrd:000 submit date:120503225014 done date:120504045020 stat:EXPIRED err:027 text:,FLP,SMSCReceiptMsgId=4FA23E97 
2012-05-04 04:50:24,23777407598,Text,id:4FA23EAE sub:000 dlvrd:000 submit date:120503225017 done date:120504045023 stat:EXPIRED err:032 text:,FLP,SMSCReceiptMsgId=4FA23EAE 
2012-05-04 04:50:26,23777736950,Text,id:4FA23EAF sub:000 dlvrd:000 submit date:120503225018 done date:120504045024 stat:EXPIRED err:027 text:,FLP,SMSCReceiptMsgId=4FA23EAF 
2012-05-04 04:50:31,23775834128,Text,id:4FA23ED6 sub:000 dlvrd:000 submit date:120503225024 done date:120504045030 stat:EXPIRED err:027 text:,FLP,SMSCReceiptMsgId=4FA23ED6 
2012-05-04 04:50:36,23777486441,Text,id:4FA23EF3 sub:000 dlvrd:000 submit date:120503225029 done date:120504045035 stat:EXPIRED err:027 text:,FLP,SMSCReceiptMsgId=4FA23EF3 

现在我想从这些内容对于喜欢“ID,完成日期几个特定的​​领域获取的价值, stat“通过使用正则表达式与c#.net和LINQ。

请帮助我,如果任何人有任何想法做到这一点。

+2

任何特定l您想使用哪种语言? – Keppil 2012-07-17 10:19:29

+1

你打算使用哪种正则表达式引擎? – 2012-07-17 10:20:22

+3

为什么不使用csv解析器? – 2012-07-17 10:20:43

回答

0

可能是一个csv解析器会更好,但你可以使用这个正则表达式并替换id:与你想要的其他字段。前done date:(?<done date>.*?)\s

string strRegex = @"id:(?<id>.*?)\s.*?done date:(?<donedate>.*?)\s.*?stat:(?<stat>.*?)\s"; 
RegexOptions myRegexOptions = RegexOptions.IgnoreCase | RegexOptions.Multiline; 
Regex myRegex = new Regex(strRegex, myRegexOptions); 
string strTargetString = @"2012-07-16 03:20:41,23796160897,Text,id:SAR-23796160897-c0-2-1 sub:000 dlvrd:001 submit date:120715220216 done date:120716032038 stat:DELIVRD err:000 text:,FOTSO TOKAM,SMSCReceiptMsgId=SAR-23796160897-c0-2-1" 
foreach (Match myMatch in myRegex.Matches(strTargetString)) 
{ 
    if (myMatch.Success) 
    { 
    // Add your code here 
    //myMatch.Groups["id"].Value; 
    //myMatch.Groups["donedate"].Value; 
    //myMatch.Groups["stat"].Value; 
    } 
} 

您可以使用正则表达式一个接着id:(?<id>.*?)\s.*?done date:(?<donedate>.*?)\s.*?stat:(?<stat>.*?)\s与组访问像myMatch.Groups["id"].Value

+0

是否有可能一起阅读所有特定字段,如“ID,做日期,统计”? – 2012-07-17 10:49:12

+0

我更新了答案 – tsukimi 2012-07-17 11:05:06

+0

谢谢tsukimi你的帖子对我来说非常有帮助,并为我节省了很多时间。 – 2012-07-17 11:08:41

2

我不认为你正则表达式将帮助你在这里有很多。相反,您应该将行分成行然后分成列,因为我可以看到数据可以分割成矩阵,从中可以轻松提取您正在查找的信息......即使您可以在JavaScript/C#/ Java中执行此操作或任何语言。

在我的实践中做到这一点的:

  • 将数据分割成线
  • 行拆分成列
  • 然后通过各条线和点,你正在寻找的列进行迭代。

    var content = data.split('\n'); 
    foreach(var line in content) 
    { 
        var cols = line.split(','); 
        var c1 = cols[0]; 
        var c2 = cols[1]; 
        var c3 = cols[2]; 
    } 
    

您可以细化上面摘录,以满足您的需要......这就是要做到这一点的最好办法。

1

不清楚你所有的字段是什么意思,或者如果分隔符是不变的。使用您提供的测试数据,这会将大部分信息转化为命名组。

/// <summary> 
/// Regular expression built for C# on: Tue, Jul 17, 2012, 12:08:12 PM 
/// Using Expresso Version: 3.0.4334, http://www.ultrapico.com 
/// 
/// A description of the regular expression: 
/// 
/// Beginning of line or string 
/// [Date]: A named capture group. [[^,]+] 
///  Any character that is NOT in this class: [,], one or more repetitions 
/// , 
/// [Number]: A named capture group. [[^,]+] 
///  Any character that is NOT in this class: [,], one or more repetitions 
/// , 
/// [Text1]: A named capture group. [[^,]+] 
///  Any character that is NOT in this class: [,], one or more repetitions 
/// , 
/// id: 
///  id: 
/// [ID]: A named capture group. [[^\s]+] 
///  Any character that is NOT in this class: [\s], one or more repetitions 
/// Whitespace 
/// sub: 
///  sub: 
/// [Sub]: A named capture group. [\w+] 
///  Alphanumeric, one or more repetitions 
/// Whitespace 
/// dlvrd: 
///  dlvrd: 
/// [Dlvrd]: A named capture group. [\w+] 
///  Alphanumeric, one or more repetitions 
/// Whitespace 
/// submit\sdate: 
///  submit 
///  Whitespace 
///  date: 
/// [SubmitDate]: A named capture group. [\w+] 
///  Alphanumeric, one or more repetitions 
/// Whitespace 
/// done\sdate: 
///  done 
///  Whitespace 
///  date: 
/// [DoneDate]: A named capture group. [\w+] 
///  Alphanumeric, one or more repetitions 
/// Whitespace 
/// stat: 
///  stat: 
/// [Status]: A named capture group. [\w+] 
///  Alphanumeric, one or more repetitions 
/// Whitespace 
/// err: 
///  err: 
/// [Error]: A named capture group. [\d+] 
///  Any digit, one or more repetitions 
/// Whitespace 
/// 
/// 
/// </summary> 
public static Regex regex = new Regex(
     "^(?<Date>[^,]+),\r\n(?<Number>[^,]+),\r\n(?<Text1>[^,]+),\r\nid:(?"+ 
     "<ID>[^\\s]+)\\s\r\nsub:(?<Sub>\\w+)\\s\r\ndlvrd:(?<Dlvrd>\\w+)\\s"+ 
     "\r\nsubmit\\sdate:(?<SubmitDate>\\w+)\\s\r\ndone\\sdate:(?<DoneD"+ 
     "ate>\\w+)\\s\r\nstat:(?<Status>\\w+)\\s\r\nerr:(?<Error>\\d+)\\s", 
    RegexOptions.Multiline 
    | RegexOptions.ExplicitCapture 
    | RegexOptions.CultureInvariant 
    | RegexOptions.IgnorePatternWhitespace 
    | RegexOptions.Compiled 
    ); 

所以用这个,你可以拨打:

var matches = regex.Matches(inputData); 

我个人建议你限制了测试数据的一行,而是称之为:

var match = regex.Match(inputLineOfData); 

这意味着您可以:

if (match.Success) 
{ 
    var id = match.Groups["ID"].Value; 
    var submitDate = match.Groups["SubmitDate"].Value; // Parse to DateTime 
    var doneDate = match.Groups["DoneDate"].Value; // Parse to DateTime 

    // etc for 'sub', 'dlvrd', 'Status', 'Error'.. 
}