2013-05-13 25 views
0

我有一个小问题,不知道从哪里开始。 我有一个文本文件,其中包含以下信息。将文本格式化为单独的文件

MINI COOPER 2007, 30,000 miles, British Racing Green, full service history, metallic paint, alloys. Great condition. £5,995 ono Telephone xxxxx xxxxx 

我需要填充上述信息的格式如下

<advert> 
    <manufacturer></manufacturer> 
    <make></make> 
    <model></make> 
    <price></price> 
    <miles></miles> 
    <image></image> 
    <desc><![CDATA[desc> 
    <expiry></expiry> // Any point in the future 
    <url></url> // Optional 
</advert> 
<advert> 

输出应该是。

</advert> 
<advert> 
    <manufacturer>MINI</manufacturer> 
    <make></make> 
    <model></make> 
    <price>5,995</price> 
    <miles>30000</miles> 
    <image></image> 
    <desc><![CDATA[2007, British Racing Green, full service history, metallic paint, alloys. Great condition.Telephone xxxxxx xxxxxx]]></desc> 
    <expiry>Todays date 13/05/2013</expiry> 
    <url></url> 
</advert> 

任何帮助将创建赞赏。

+0

一个'python'脚本,或者'gawk'脚本中使用'-F,'能有所帮助。你尝试了什么?没有显示你尝试过的代码,你将无法获得帮助... – 2013-05-13 11:00:41

+0

我曾经有人指点我在正确的方向.. – 2013-05-13 11:11:40

+0

我确实指出了一些方向......但你必须学习足够的关注他们。 – 2013-05-13 11:14:35

回答

1

由于有时逗号是字段的一部分,有时它们不是你不能使用逗号或其他任何字段作为分隔符,所以你需要在GNU awk(对于gensub()和strftime())这样的东西。 :

gawk '{ 
    print "<advert>" 
    printf "\t<manufacturer>%s</manufacturer>\n", $1 
    printf "\t<make></make>\n" 
    printf "\t<model></model>\n" 
    printf "\t<price>%s</price>\n", gensub(/.*£([[:digit:],]+).*/,"\\1","") 
    printf "\t<miles>%s</miles>\n", gensub(/.*[[:space:]]([[:digit:],]+)[[:space:]]+miles.*/,"\\1","") 
    printf "\t<image></image>\n" 
    printf "\t<desc><![CDATA[%s]]></desc>\n", gensub(/.*[[:space:]]+miles[[:space:]]*,[[:space:]]*(.*)/,"\\1","") 
    printf "\t<expiry>Todays date %s</expiry>\n", strftime("%d/%m/%Y") 
    printf "\t<url></url>\n" 
    print "</advert>" 
}' file 

我的编辑似乎窒息英镑的迹象所以这里是一个使用#符号,而不是运行上面的脚本:

$ cat file 
MINI COOPER 2007, 30,000 miles, British Racing Green, full service history, metallic paint, alloys. Great condition. #5,995 ono Telephone xxxxx xxxxx 

$ gawk '{ 
    print "<advert>" 
    printf "\t<manufacturer>%s</manufacturer>\n", $1 
    printf "\t<make></make>\n" 
    printf "\t<model></model>\n" 
    printf "\t<price>%s</price>\n", gensub(/.*#([[:digit:],]+).*/,"\\1","") 
    printf "\t<miles>%s</miles>\n", gensub(/.*[[:space:]]([[:digit:],]+)[[:space:]]+miles.*/,"\\1"," 
") 
    printf "\t<image></image>\n" 
    printf "\t<desc><![CDATA[%s]]></desc>\n", gensub(/.*[[:space:]]+miles[[:space:]]*,[[:space:]]*(. 
*)/,"\\1","") 
    printf "\t<expiry>Todays date %s</expiry>\n", strftime("%d/%m/%Y") 
    printf "\t<url></url>\n" 
    print "</advert>" 
}' file 
<advert> 
     <manufacturer>MINI</manufacturer> 
     <make></make> 
     <model></model> 
     <price>5,995</price> 
     <miles>30,000</miles> 
     <image></image> 
     <desc><![CDATA[British Racing Green, full service history, metallic paint, alloys. Great con 
dition. #5,995 ono Telephone xxxxx xxxxx]]></desc> 
     <expiry>Todays date 13/05/2013</expiry> 
     <url></url> 
</advert> 
0

下面是一些例子代码,应该让你去至少。的script.awk

awk -f script.awk file.txt 

内容:

{ 
    for (i=1;i<=NF;i++) { 

     if ($i == "miles,") { 
      miles = $(i - 1) 

      $i = $(i - 1) = "" 
     } 

     if ($i ~ /£/) { 
      price = substr($i, 2) 

      $i = $(i + 1) = "" 
     } 
    } 

    gsub(/ +/, " "); 

    print "<advert>" 
    print "\t<manufacturer>" $1 "</manufacturer>" 
    print "\t<make></make>" 
    print "\t<model></make>" 
    print "\t<price>" price "</price>" 
    print "\t<miles>" miles "</miles>" 
    print "\t<image></image>" 
    print "\t<desc><![CDATA[" $0 "]></desc>" 
    print "\t<expiry>" strftime("%d/%m/%Y") "</expiry>" 
    print "\t<url></url>" 
    print "</advert>" 
} 

结果:像运行

<advert> 
    <manufacturer>MINI</manufacturer> 
    <make></make> 
    <model></make> 
    <price>5,995</price> 
    <miles>30,000</miles> 
    <image></image> 
    <desc><![CDATA[MINI COOPER 2007, British Racing Green, full service history, metallic paint, alloys. Great condition. Telephone xxxxx xxxx]></desc> 
    <expiry>13/05/2013</expiry> 
    <url></url> 
</advert> 
+0

非常感谢大家。至少我有一些从哪里开始。 – 2013-05-13 12:50:27

+0

Steve。非常感谢您的信息。 – 2013-05-13 13:08:47