2016-10-04 21 views
0

我目前正在处理包含格式化为数据块的文件信息的大型数据集。我正在尝试从文件路径行获取一段数据,并将其作为新列添加到特定行上。该数据集包含格式化的,像这样的文件信息:使用awk或sed格式化特定数据

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar 
Inode Num: 22525898 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
45:97:2a:60:e3:69    3208     10 
7a:8b:8e:20:7b:38    1982     10 
b9:45:3d:f4:97:88    1849     10 
Whole File Hash: 865999b40fd9 

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c 
Inode Num: 31881221 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
e8:b0:cb:6f:76:ff    1344     10 
19:c5:b2:aa:b3:60    613      10 
11:7c:7e:76:4b:d5    1272     10 
36:e0:59:49:b6:4a    581      10 
9c:31:bc:8a:39:94    3296     10 
01:f0:56:3a:e1:a9    1140     10 
Whole File Hash: 4b28b44ae03d 

我所想要做的是采取文件类型(.jar和.C在这个例子中),并追加到各自的块散列行,以便最终格式化看起来像:

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar 
Inode Num: 22525898 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth)  
45:97:2a:60:e3:69    3208     10        .jar 
7a:8b:8e:20:7b:38    1982     10        .jar 
b9:45:3d:f4:97:88    1849     10        .jar 
Whole File Hash: 865999b40fd9 

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c 
Inode Num: 31881221 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth)  
e8:b0:cb:6f:76:ff    1344     10        .c 
19:c5:b2:aa:b3:60    613      10        .c 
11:7c:7e:76:4b:d5    1272     10        .c 
36:e0:59:49:b6:4a    581      10        .c 
9c:31:bc:8a:39:94    3296     10        .c 
01:f0:56:3a:e1:a9    1140     10        .c 
Whole File Hash: 4b28b44ae03d 

我已经有awk的代码拉文件类型和块散列线:

awk 'match($0,/\..+/) {print substr($0,RSTART,RLENGTH)}' 

awk '/Chunk Hash/{flag=1;next}/Whole File Hash:/{flag=0}flag' 

我只是对如何使用这些连接件不知道wk(或sed)将文件类型作为新列附加到其各自数据块中的每一行上。另一件需要注意的是,我正试图在bash脚本中做到这一点,如果这有所作为。

回答

2

这里是一个(GNU)sed的溶液:

/File path:/ {   # If line matches "File path:" 
    h     # Copy pattern space to hold space 
    s/.*(\.[^.]*)$/\1/ # Remove everything but extension from pattern space 
    x     # Swap pattern space and hold space 
}      # Hold space now contains extension 
/Chunk Hash/ {   # If line matches "Chunk Hash" 
    n     # Get next line into pattern space 
    :loop    # Anchor for loop 
    /Whole File Hash/b # If line matches "Whole File Hash", jump out of loop 
    G     # Append extension from hold space to pattern space 
    s/\n/\t\t\t\t/  # Substitute newline with a bunch of tabs 
    n     # Get next line 
    b loop    # Jump back to ":loop" label 
} 

这可以被存储在单独的文件中(说,so.sed),并且必须被称为像

sed -r -f so.sed infile 

导致

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar 
Inode Num: 22525898 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
45:97:2a:60:e3:69    3208     10        .jar 
7a:8b:8e:20:7b:38    1982     10        .jar 
b9:45:3d:f4:97:88    1849     10        .jar 
Whole File Hash: 865999b40fd9 

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c 
Inode Num: 31881221 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
e8:b0:cb:6f:76:ff    1344     10        .c 
19:c5:b2:aa:b3:60    613      10        .c 
11:7c:7e:76:4b:d5    1272     10        .c 
36:e0:59:49:b6:4a    581      10        .c 
9c:31:bc:8a:39:94    3296     10        .c 
01:f0:56:3a:e1:a9    1140     10        .c 
Whole File Hash: 4b28b44ae03d 

非GNU SEDS必须通过the usual hoops跳转到插入选项卡并不能使用-r选项(但可能-E,这应该是相当于在这里; -r只是为了方便才得以逃脱())。

+0

某些行加倍,应删除从地址范围块的'p'命令。 – SLePort

+1

@Kenavoz呃,是的,'N'没有'-n'选项打印......谢谢! –

+0

这很好,谢谢! –

2

解在TXR语言:

@(repeat) 
@ (cases) 
File path: @*[email protected] 
Inode Num: @inode 
@header 
@ (collect) 
@hashline 
@ (last) 
Whole File Hash: @wfh 
@ (end) 
@ (output) 
File path: @[email protected] 
Inode Num: @inode 
@header 
@  (repeat) 
@{hashline 88}[email protected] 
@  (end) 
Whole File Hash: @wfh 
@ (end) 
@ (or) 
@other 
@ (do (put-line other)) 
@ (end) 
@(end) 

执行命令

$ txr suffixes.txr data 
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar 
Inode Num: 22525898 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
45:97:2a:60:e3:69    3208     10        .jar 
7a:8b:8e:20:7b:38    1982     10        .jar 
b9:45:3d:f4:97:88    1849     10        .jar 
Whole File Hash: 865999b40fd9 

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c 
Inode Num: 31881221 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
e8:b0:cb:6f:76:ff    1344     10        .c 
19:c5:b2:aa:b3:60    613      10        .c 
11:7c:7e:76:4b:d5    1272     10        .c 
36:e0:59:49:b6:4a    581      10        .c 
9c:31:bc:8a:39:94    3296     10        .c 
01:f0:56:3a:e1:a9    1140     10        .c 
Whole File Hash: 4b28b44ae03d 
0

在AWK:

$ cat script.awk 
/File path/ { 
    match($0,/\..+/) 
    ext=substr($0,RSTART,RLENGTH) 
} 
/Chunk Hash/ { 
    flag=1   # flag on 
    print    # print here to... 
    next    # avoid printing ext 
} 
/Whole File Hash:/ { 
    flag=0   # flag off 
} 
flag==1 { 
    print $0, ext  # add space here to your liking, left it short... 
    next    # ... to show output on screen without sidescrolling 
} 1     # print non-flagged records 

执行命令

$ awk -f script.awk data.txt 
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar 
Inode Num: 22525898 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
45:97:2a:60:e3:69    3208     10 .jar 
7a:8b:8e:20:7b:38    1982     10 .jar 
b9:45:3d:f4:97:88    1849     10 .jar 
Whole File Hash: 865999b40fd9 

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c 
Inode Num: 31881221 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
e8:b0:cb:6f:76:ff    1344     10 .c 
19:c5:b2:aa:b3:60    613      10 .c 
11:7c:7e:76:4b:d5    1272     10 .c 
36:e0:59:49:b6:4a    581      10 .c 
9c:31:bc:8a:39:94    3296     10 .c 
01:f0:56:3a:e1:a9    1140     10 .c 
Whole File Hash: 4b28b44ae03d 
0
awk --re-interval ' 
/^File/{         #If the beginning of line matches "File" 
    s=gensub("[^.]+(.*)","\\1","1",$0); #Gain the keywords like ".c,.jar" and assign them to s 
} 
/(..:){3,}/{        #If line matches "**:" three times or more 
    gsub("[0-9]+$","&\t\t\t\t\t" s,$0) #At the end add s 
} 
1' file         #Print line 
+0

对不起,我的英文不好,很难表达。我会尽量写一些解释。 – zxy