2017-03-21 40 views
0

我需要帮助将此xml文件格式化为以逗号分隔的形式导入到表中。我玩过sed和awk,但这是一场艰苦的斗争。使用sed或awk格式化为逗号分隔的XML

例子:

<requestID>224</requestID>, 
    <ErrorMessage>The following is required: PersonName </ErrorMessage>, 
    <?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>224</requestID><DWLControl></TCRMService> 
<requestID>615</requestID>, 
    <ErrorMessage>The following is required: PersonName </ErrorMessage>, 
    <?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>224</requestID><DWLControl></TCRMService> 

结果:

<requestID>224</requestID>,<ErrorMessage>The following is required: PersonName </ErrorMessage>,<?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>224</requestID><DWLControl></TCRMService> 
<requestID>615</requestID>,<ErrorMessage>The following is required: PersonName </ErrorMessage>,<?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>224</requestID><DWLControl></TCRMService> 

我已经能够补充,我想

sed 's/ErrorMessage>$/ErrorMessage>,/; s/requestID>$/requestID>,/' 

逗号,我认为这将是较好的去除标签,但它也删除所有的空间。

tr -d ' \t' <grep.xml > test.xml 

我不知道如何一行移动到前一行的末尾...

所以这部分工作...

awk '{if ($0 ~ /<ErrorMessage>,*/) { printf "%s", $0; getline var; printf "%s\n", var} else {print $0}}' test.xml 


    <requestID>260</requestID>, 
      <ErrorMessage>The following is required: PersonName</ErrorMessage>,<?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>260</requestID></TCRMService> 

但现在我有麻烦将错误消息移动到RequestID行的末尾......

请注意,在ErrorMessage行中,requestID也位于同一行中。我认为关键是看该模式匹配上

  </requestID>, 
+0

请求ID 615从哪里来? –

+0

对不起,它假设为615.每个requestID代表一个唯一的记录。 – Janie

+0

它仍然在两条线上都表示对“ID 224”的“请求控制”。 –

回答

0

试试这个 -

awk -v FS="" '{gsub(/^[[:space:]]+/,"",$0);ORS=(NR%3==0?RS:FS)}1' f 
+0

哇。这工作。谢谢。去研究和了解语法的含义。 – Janie

+0

欢迎,您可以从这里开始您的研究 - https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html –

0

在awk中,非常QND(假定只有空格,无标签):

$ awk '{gsub(/^ +| +$|, *$/,"");printf "%s%s", ($0~/^ *<requestID>/?ORS:","), $0}END{print ""}' file 

<requestID>224</requestID>,<ErrorMessage>The following is required: PersonName </ErrorMessage>,<?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>224</requestID><DWLControl></TCRMService> 
<requestID>224</requestID>,<ErrorMessage>The following is required: PersonName </ErrorMessage>,<?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>224</requestID><DWLControl></TCRMService> 

现在只需要去除导致换行符但我需要赶上公交车(我可以得到一个交通,男子)。

+0

所以我试了这个,我得到的错误: awk:正则表达式中的非法初级^ + |?+ $ |,* $ at + $ |,* $ 源代码行号1 上下文为 \t {gsub(/^+ |?+ $ |,>>> * $ /,“”)<<< – Janie

+0

是,'?'作为正则表达式中的第一个字符是不明确的,所以有些awks会告诉你,而另一些人可能会认为你的意思是字面意思。我没有读过Q这么说,但是无论它是什么,只用'?'开始一个正则表达式段是错误的。 –

+0

这是一个错字。无论如何,在这种情况下没有任何意义(修剪:'gsub(/ ... |?+ $ | ... /)')。 –

0

所以这部分工作...

awk '{if ($0 ~ /<ErrorMessage>,*/) { printf "%s", $0; getline var; printf "%s\n", var} else {print $0}}' test.xml 


    <requestID>260</requestID>, 
      <ErrorMessage>The following is required: PersonName</ErrorMessage>,<?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>260</requestID></TCRMService> 

但现在我有移动的ErrorMessage高达请求ID行的末尾麻烦....

请不,在的ErrorMessage线, requestID也在同一行。

0

为什么不Perl的片段?随着波纹管新线被移除,超过两个的空间被移除。由于您在主要问题中建议的输入文件已经有相应的逗号,因此不会添加逗号。

$ cat file3 |nl 
    1 <requestID>224</requestID>, 
    2  <ErrorMessage>The following is required: PersonName </ErrorMessage>, 
    3  <?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>224</requestID><DWLControl></TCRMService> 
    4 <requestID>615</requestID>, 
    5  <ErrorMessage>The following is required: PersonName </ErrorMessage>, 
    6  <?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>224</requestID><DWLControl></TCRMService> 

$ perl -pe 's/\n//g; s/[[:space:]]{2,}//g; s/<\/TCRMService>/$&\n/g' file3 |nl 
    1 <requestID>224</requestID>,<ErrorMessage>The following is required: PersonName </ErrorMessage>,<?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>224</requestID><DWLControl></TCRMService> 
    2 <requestID>615</requestID>,<ErrorMessage>The following is required: PersonName </ErrorMessage>,<?xml version="1.0" encoding="UTF-8"?><TCRMService xmlns="http://www.ibm.com/mdm/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/mdm/schema MDMDomains.xsd"><RequestControl><requestID>224</requestID><DWLControl></TCRMService> 
+0

您选择使用awk解决方案,但我想只要知道我的信息,如果这个解决方案适用于您的真实数据。 –