awk正则表达式：使用它或不使用变量之间的区别

我有一个awk脚本，当我在不同的地方放置正则表达式时，其行为会有所不同。很明显，我让程序的逻辑在两种情况下都是一样的，但事实并非如此。该脚本用于分析每个事务具有唯一ID的一些日志。日志看起来像awk正则表达式：使用它或不使用变量之间的区别

timestamp (ID) more info

例如：

2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with real information and a key string that determines the type of thransaction 
2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with other information 
2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with more information 
2014-10-06 05:24:40,035 INFO (4xxbbbbbbbbbbbbbcccb) [somestring] this is a different transaction

我想是处理特定类型的交易的所有日志行，看看他们是如何花费的时间。每笔交易分散在多个日志行中，并由其唯一ID标识。要知道某个交易是否属于我想要的类型，我必须在该交易的第一行中搜索某个字符串。在日志中可以是没有上述格式的行。

我想要什么：

区分，如果当前行是事务（它有一个ID）

检查的ID是在累积阵列已注册的一部分。

如果不是，请检查它是否具有所需的类型：在行的主体中搜索固定字符串。

如果是，注册时间戳，等等等等

这里是代码（注意，这是一个非常精缩版）。

这是我想用，首先检查它是否是一个交易行和检查后，如果它是正确的类型

awk '$4 ~ /^\([:alnum:]/ { name=$4;gsub(/[()]|:.*/,"",name);++matched if(!(name in arr)){ if($0 ~ /transaction type/){arr[name]=1;print name}} }END { print "Found :"length(arr) print "Processed "NR print matched" lines matched the filter" }'

该脚本只发现868个交易什么，有一些超过14K。如果我将脚本更改为如下所示的代码，如果找到所有14k事务，但仅查找所有这些事务的第一行，那么对我来说没有用处。

awk '/transaction type/ { name=$4;gsub(/[()]|:.*/,"",name);++matched if(!(name in arr)){ arr[name]=1;print name } }END { print "Found :"length(arr) print "Processed "NR print matched" lines matched the filter" }'

在此先感谢。

编辑

对我感到羞耻。这个话题有不止一个实际问题。主要的是正则表达式不匹配正确的字符串。 ID字符串和事务字符串的类型在同一行上，这是真的，但在这些行上，ID是（aaaaaabbbbbcccc：），最后有两个空格。这使得AWK将“（aaaaaaaabbbbcccc：”和“）”解析为作为两个不同的字段。我意识到当我做

$4 !~ /regex/ print $4

和大量有效的ID出现。

修正正则表达式后出现的第二个问题已经在这里被一些人解决了。主要的正则表达式和冷杉（分隔线让awk打印每条记录。我意识到自己和同一天后我在这里读到的解决方案。惊人。

非常感谢每一个人。我只能接受一个有效的答案，但我从他们中学到了很多东西。

来源

2014-10-06 Danielo515

您可以考虑使用logstash与神交和多过滤器这样的工作。我很不确定你输入的内容是什么，因为在你的例子中只有一行格式。 – Tensibai 2014-10-06 12:35:54

你好。我无法安装比可用的程序更多的程序。我只对符合上述格式的行感兴趣，所以IMO没有问题。我不知道所有的线路是怎样的，但这根本不重要。 – Danielo515 2014-10-06 12:53:57

'/ transaction type /'与您的示例输入行不匹配。这使得很难确定哪些可能是错误的。你能给我们实际的日志行和你匹配的实际字符串/正则表达式吗？ – 2014-10-06 12:57:24

空白问题毫无头绪。这：

/foo/ { 
    print "found" 
}

手段print 'found' every time "foo" is present而这一点：

/foo/ 
{ 
    print "found" 
}

意味着print the current record every time "foo" is present and print "found" for every single input record因此机会是当你写道：

$4 ~ /^\([:alnum:]/ 
{ 
    .... 
}

你真正的意思是写：

$4 ~ /^\([:alnum:]/ { 
    .... 
}

同时，机会是你的意思是使用POSIX字符类[[:alnum:]]代替字符集[ : a l n u m的描述由字符集[:alnum:]：

$4 ~ /^\([[:alnum:]]/ { 
    .... 
}

如果你解决这些事情，你仍然需要帮助，提供一些可检验的样品输入和预期输出我们可以帮助您更多。

来源

2014-10-06 13:24:48

对于线路返回，我确实认为这是为了便于阅读而在这里发布的格式，但值得注意。我在爆炸我的例子时做了同样的错误:) – Tensibai 2014-10-07 11:57:22

这只是一个语法错误。当您使用POSIX字符类，你必须用方括号内：

[[:alnum:]]

否则[:alnum:]被看作是包含: a l m n u

来源

2014-10-06 13:16:36

所以在短暂的字符类，如果我正确理解你希望得到的IDS某些类型的交易。

第一个假设：ID和交易类型是在同一条线上，这样的事情应该做的（主要来自你的代码改编）从您的样品输入

awk 'BEGIN { 
    matched=0 # more for clarity than really needed 
} 
/\([[:alnum:]]*\).*transaction type/ { # get lines matching the id and the transaction only 
    gsub(/[()]/,"",$4) # strip the() around the id 
    ++matched # to get the number of matched lines including the multiples ones. 
    if (!($4 in arr)) { # as yours, if the id is not in array 
    arr[$4]=1 # add the found id to array for no including it twice 
    print $4 # print the found id (only once as we're in the if 
    } 
} 
END { # nothing changed here, printing the stats... 
    print "Found :"length(arr) 
    print "Processed "NR 
    print matched" lines matched the filter" 
}'

输出的是：

prompt=> awk 'BEGIN { matched=0};/\([a-z0-9]*\)/{ gsub(/[()]/,"",$4); ++matched; if (!($4 in arr)) { arr[$4]=1; print $4 }}; END { print "Found: "length(arr)"\nProcessed "NR"\n"matched" lines matched the filter" }' awkinput 
4aaaaaaaaabbbbbbcccb 
4xxbbbbbbbbbbbbbcccb 
Found: 2 
Processed 4 
4 lines matched the filter

我已经在测试中忽略，则交易，因为我已经什么它可能会在AWK

来源

2014-10-06 13:19:12 Tensibai

awk正则表达式：使用它或不使用变量之间的区别

编辑

回答

相关问题