2012-03-13 83 views
3

我有一个包含类似数据的文件:搜索的行中的特定字段文件

0000380000000101 
0000650000000201 
0000650000000301 
0000650000000401 
0001000000000101 
0001000000000201 

....等等。我想处理这些数据,让我得到这样

000065 0000000201 0000000301 0000000401 
000100 0000000101 0000000201 

由于000065的输出重复3次,在输出我想000065只出现一次,而在每个条目对应的字节只要发生000065应打印。因为,000038只有一次,我不想要这个输出。在这个例子中,数据(即000065或000038碰巧是3个字节,尽管它可以是任何长度,而像0000000401之后的字节将是固定长度,即5个字节)。我想要最好使用shell脚本或c。请让我知道我该怎么做。 awk可以在这里有所帮助吗? 任何帮助将不胜感激。下面是实际的文件所采取的数据,我想的过程:

0000000000000101 
0000000000000201 
0000000000000301 
0000000000000401 
0000380000000101 
0000650000000201 
0000650000000301 
0000650000000401 
0001000000000101 
0001000000000201 
0001000000000301 
0001000000000401 
0038d30000000101 
00652e0000000201 
00652e0000000301 
00652e0000000401 
008d750000000101 
008d750000000201 
008d750000000301 
008d750000000401 
0100010000000101 
0100010000000201 
0100010000000301 
0100010000000401 
01008d0000000101 
01008d0000000201 
01008d0000000301 
01008d0000000401 
01a8c00000000101 
01a8c00000000201 
01a8c00000000301 
01a8c00000000401 
0264010000000101 
0264010000000201 
0264010000000301 
0264010000000401 
0615df0000000101 
0615df0000000201 
0615df0000000301 
0615df0000000401 
07dd940000000101 
07dd940000000201 
07dd940000000301 
07dd940000000401 
0900000000000101 
0900000000000201 
0900000000000301 
0900000000000401 
15dfc70000000101 
15dfc70000000201 
15dfc70000000301 
15dfc70000000401 
1ecf090000000101 

回答

1

这可能会实现为你(是sed行吗?):

sed ':a;$!N;s/^\(.*\)\(\(*.\{10\}\)*\)\n\1/\1\2 /;ta;/ /!D;s/.\{10\}/&/;P;D' file 
000065 0000000201 0000000301 000000401 
000100 0000000101 0000000201 
4

你的数据是固定的宽度,所以你可以使用gawk

$ gawk -v FIELDWIDTHS='6 10' 'NR!=1 && x==$1""{printf(" %s", $2); next}; {x=$1""; printf("%s%s %s", NR==1?"":"\n", $1, $2)}; END{print ""}' input.txt | sed '/^[0-9a-f]* [0-9a-f]*$/d' 
000000 0000000101 0000000201 0000000301 0000000401 
000065 0000000201 0000000301 0000000401 
000100 0000000101 0000000201 0000000301 0000000401 
00652e 0000000201 0000000301 0000000401 
008d75 0000000101 0000000201 0000000301 0000000401 
010001 0000000101 0000000201 0000000301 0000000401 
01008d 0000000101 0000000201 0000000301 0000000401 
01a8c0 0000000101 0000000201 0000000301 0000000401 
026401 0000000101 0000000201 0000000301 0000000401 
0615df 0000000101 0000000201 0000000301 0000000401 
07dd94 0000000101 0000000201 0000000301 0000000401 
090000 0000000101 0000000201 0000000301 0000000401 
15dfc7 0000000101 0000000201 0000000301 0000000401 

FIELDWIDTHS A white-space separated list of fieldwidths. When set, gawk parses the input into fields of fixed width, instead of using the value 
       of the FS variable as the field separator. 
+0

[UUOC](https://en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat)alert! – 2012-03-13 12:50:08

+0

你是一位awk高手! – 2012-03-13 12:51:39

+0

在Mac上没有帮我工作 – anubhava 2012-03-13 12:55:52

1

awk与FIELDWIDTHS是显示kev的一种方式。

这里是另一种方式(oneliner)仅使用awk:

awk 'BEGIN{FS=""} 
    {for(i=1;i<=6;i++) x=x$i; y=$0; gsub("^"x,"",y);a[x]=a[x]?a[x]" "y:y; x="";} 
    END{for(t in a)print t" "a[t]}' yourFile 

测试你的小数据块:

kent$ echo "0000380000000101 
0000650000000201 
0000650000000301 
0000650000000401 
0001000000000101 
0001000000000201"|awk 'BEGIN{FS=""} {for(i=1;i<=6;i++) x=x$i; y=$0; gsub("^"x,"",y);a[x]=a[x]?a[x]" "y:y; x="";}END{for(t in a)print t" "a[t]}' 

000100 0000000101 0000000201 
000065 0000000201 0000000301 0000000401 
000038 0000000101 
2

可以以下awk命令(在Linux和Mac测试):

awk '{key=substr($0, 0, 6); val=substr($0, 6); arr[key]=sprintf("%s %s",val,arr[key]);} 
END{for (a in arr) {split(arr[a], el, " "); if (length(el)>1) print a, arr[a]} }' file 

OUTPUT:

000065 50000000401 50000000301 50000000201 
000100 00000000201 00000000101 
2

首先,管你的数据通过本文件:

awk '{suffixLen = 10; print substr($0, 1, length($0) - suffixLen)" "substr($0, length($0) - suffixLen + 1, length($0))}' 

的suffixLen变量是(固定的)数量的尾随字符:2个字节用于每个字符= 10。这将在分割输入串两个领域,用空格隔开。结果

awk '{if ($1 in values) {values[$1] = values[$1]" "$2} else {values[$1] = $1" "$2}}END{for (v in values) print values[v]}' 

正确排序留给读者作为练习读者:

通过这个

然后管。