AWK可几乎无处不安装了bash和可以避免一些可能会使用sed遇到的陷阱(例如,如果在XML属性并不一致排序)。
awk '
## set a variable to mark that we are in a mediaobject block
$1=="<mediaobject>" { object=1 }
## mark that we have exited the object block
$1=="</mediaobject>" { object=0 }
## if we are in an mediaobject block and we find an imageblock
$1=="<imageobject" && object==1 {
iobject=1 ## record that we are in an imageblock
id = substr($2, 5, length($2) - 6) ## this is unnecessary for output
}
## if we have a line with image data
$1~/<imagedata/ && iobject==1 {
fileref=substr($2,9,length($2)-8) ## the path, including the quotations
width=$3 ## the width
}
## if we have a caption line
$1~/<caption>/ && iobject==1 {
gsub("(</?caption>|^ *| *$)", "") ## remove xml and leading/trailing whitespace
caption=$0 ## record the modified line as the caption
}
## when we arrive at the end of an imageblock
$1=="</imageobject>" && object==1 {
iobject=0 ## record it
printf("<img src=%s %s title=\"%s\" />\n", fileref, width, caption) ## print record
}
' input
虽然正如我所说,此代码应工作得很好,不管属性是如何orded,它会失败,如果线路变更单上的属性(这不太可能)。如果遇到问题,你可以这样做:
## use match to find the beginning of the attribute
## use a nested substr() to pull only the value of fileref (with quotations)
fileref = substr(substr($0, match($0,/fileref=[a-z\/"]+/),RLENGTH),9))
很好,但它有一点瑕疵。结果是'
'。我怎样才能摆脱这些空间? –
user219882
请你能简单介绍一下代码吗?我从来没有见过这样的awk,所以我不明白它... – user219882
它应该像书面工作(所以也许剪切和粘贴错误)。具体来说,gsub行的正则表达式包含“^ *”和“* $”的匹配项,它们应该替换带有“”的那些(将其删除)。 – worfly