2016-11-24 54 views
4

我试图将xml文件转换为数据框,但格式似乎是关闭的。我查看了不同的教程,尽管我在获取需要使用for循环并导航解析文件的信息方面取得了中等成功,但我被告知此解决方案效率不高。将数据从xml转换为R数据框

我尝试这样做的代码然后:

require(XML) 
parsed<-xmlParse("SEWL.xml") 
xmlToDataFrame(parsed) 

但它提供了一个错误:在[<-.data.frame误差(*tmp*,I,名称(节点[[I]]),值= C( “\” LL18179 \ “\” 2016/08 \ “0.32485.43896.59801.2131 \” OK \ “”: 列的

这其他代码的工作,但格式是不是我所需要重复标:

require(XML) 
require(plyr) 
pldf<-ldply(xmlToList("SEWL.xml"),data.frame) 

产生的数据帧如下:

  .id    X..i.. text .attrs test.code test.validuntil test.meas.text test.meas..attrs test.meas.text.1 
1 technician    "John" <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
2 location    "CO" <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
3  temp    <NA> 21.3 celsius  <NA>   <NA>   <NA>    <NA>    <NA> 
4  runtype   "routine" <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
5  sample    <NA> <NA> 2323 "LL18179"  "2016/08"   0.3248   baseline   5.4389 
6  sample    <NA> <NA> 2323 "LL18179"  "2016/08"   0.3248   baseline   5.4389 
7  sample    <NA> <NA> 8979237 "AA09453"  "2016/03"   0.0117   baseline   5.6012 
8  sample    <NA> <NA> 8979237 "AA09453"  "2016/03"   0.0117   baseline   5.6012 
9  .attrs 2015_07_31_11_33_22 <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
10  .attrs   20150731 <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
11  .attrs    113322 <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
    test.meas..attrs.1 test.meas.text.2 test.meas..attrs.2 test.calc test.result test..attrs test.code.1 test.validuntil.1 
1    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
2    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
3    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
4    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
5     std   6.5980    data 1.2131  "OK"  laslum "ATR150607"   "2017/05" 
6     std   6.5980    data 1.2131  "OK"   3 "ATR150607"   "2017/05" 
7     std   1.1431    data 0.2041  "FAIL"  absat  <NA>    <NA> 
8     std   1.1431    data 0.2041  "FAIL"   2  <NA>    <NA> 
9    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
10    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
11    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
    test.meas.text.3 test.meas..attrs.3 test.meas.text.4 test.meas..attrs.4 test.meas.text.5 test.meas..attrs.5 
1    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
2    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
3    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
4    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
5   0.0673   baseline   4.9721    std   10.3851    data 
6   0.0673   baseline   4.9721    std   10.3851    data 
7    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
8    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
9    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
10    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
11    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
    test.calc.1 test.result.1 test..attrs.1 
1   <NA>   <NA>   <NA> 
2   <NA>   <NA>   <NA> 
3   <NA>   <NA>   <NA> 
4   <NA>   <NA>   <NA> 
5  2.0886  "Warning"   atr 
6  2.0886  "Warning"    1 
7   <NA>   <NA>   <NA> 
8   <NA>   <NA>   <NA> 
9   <NA>   <NA>   <NA> 
10  <NA>   <NA>   <NA> 
11  <NA>   <NA>   <NA> 

这是我使用的XML文件示例:

<?xml version="1.0" encoding="UTF-8"?> 
<experiment name="abc123" date="20150731" time="113322"> 
    <technician>"John"</technician> 
    <location>"CO"</location> 
    <temp scale="celsius">21.3</temp> 
    <runtype>"routine"</runtype> 
    <sample id="2323"> 
     <test name="laslum" order="3"> 
      <code>"LL18179"</code> 
      <validuntil>"2016/08"</validuntil> 
      <meas name="baseline">0.3248</meas> 
      <meas name="std">5.4389</meas> 
      <meas name="data">6.5980</meas> 
      <calc>1.2131</calc> 
      <result>"OK"</result> 
     </test> 
     <test name="atr" order="1"> 
      <code>"ATR150607"</code> 
      <validuntil>"2017/05"</validuntil> 
      <meas name="baseline">0.0673</meas> 
      <meas name="std">4.9721</meas> 
      <meas name="data">10.3851</meas> 
      <calc>2.0886</calc> 
      <result>"Warning"</result> 
     </test> 
    </sample> 
    <sample id="8979237"> 
     <test name="absat" order="2"> 
      <code>"AA09453"</code> 
      <validuntil>"2016/03"</validuntil> 
      <meas name="baseline">0.0117</meas> 
      <meas name="std">5.6012</meas> 
      <meas name="data">1.1431</meas> 
      <calc>0.2041</calc> 
      <result>"FAIL"</result> 
     </test> 
    </sample> 
</experiment> 

而且我很希望得到数据框:

experiment technician location temp runtype sample test order  code validuntil baseline std data calc result  date time 
1  abc123  John  CO 21.3 routine 2323 laslum  3 LL18179 2016/08 0.3248 5.4389 6.5980 1.2131  OK 20150731 113322 
2  abc123  John  CO 21.3 routine 2323 atr  1 ATR150607 2017/05 0.0673 4.9721 10.3851 2.0886 Warning 20150731 113322 
3  abc123  John  CO 21.3 routine 8979237 absat  2 AA09453 2016/03 0.0117 5.6012 1.1431 0.2041 FAIL 20150731 113322 

我不需要完全相同的格式,只需要足够接近以便我可以将其转换为示例。

+0

还有一个'XML2'包可能是值得期待。 – lmo

回答

6

我们提供了两种解析XML的方法。第一种方法(通过实验/样本/测试执行三重迭代)运行速度可能会更快,但第二种方法(在测试节点上使用单个循环,每个测试节点通过树来取回祖先)具有更简单的代码。

1)在Note中使用Lines我们在实验/样本/测试节点上实现了三重xpathApply/xpathSApply迭代。 est分别代表当前这样的节点。

library(XML) 
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE) 

do.call("rbind", xpathApply(doc, "//experiment", function(e) { 
    data.frame(experiment = xmlAttrs(e)[["name"]], 
     technician = xmlValue(e[["technician"]]), 
     location = xmlValue(e[["location"]]), 
     temp = xmlValue(e[["temp"]]), 
     runtype = xmlValue(e[["runtype"]]), 
     t(do.call(cbind, xpathApply(e, "sample", function(s) { 
      sample <- xmlAttrs(s)[["id"]] 
      xpathSApply(s, "test", function(t) { 
        c(sample = sample, 
         test = xmlAttrs(t)[["name"]], 
         order = xmlAttrs(t)[["order"]], 
         code = xmlValue(t[["code"]]), 
         validuntil = xmlValue(t[["validuntil"]]), 
         baseline = xmlValue(t["meas"][[1]]), 
         std = xmlValue(t["meas"][[2]]), 
         data = xmlValue(t["meas"][[3]]), 
         calc = xmlValue(t[["calc"]]), 
         result = xmlValue(t[["result"]]) 
      )})}))), 
     date = xmlAttrs(e)[["date"]], 
     time = xmlAttrs(e)[["time"]] 
)})) 

,并提供:

experiment technician location temp runtype sample test order 
1  abc123  "John"  "CO" 21.3 "routine" 2323 laslum  3 
2  abc123  "John"  "CO" 21.3 "routine" 2323 atr  1 
3  abc123  "John"  "CO" 21.3 "routine" 8979237 absat  2 
     code validuntil baseline std data calc result  date 
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131  "OK" 20150731 
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731 
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731 
    time 
1 113322 
2 113322 
3 113322 

2)这是在其中我们循环仅在测试节点的另一种方法,然后到达向上到父母和祖父母得到相应的样品和实验性功能信息。

library(XML) 
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE) 

do.call("rbind", xpathApply(doc, "//test", function(t) { # t is test node 
     s <- xmlParent(t) # s is sample node 
     e <- xmlParent(s) # e is experiment node 
     data.frame(experiment = xmlAttrs(e)[["name"]], 
      technician = xmlValue(e[["technician"]]), 
      location = xmlValue(e[["location"]]), 
      temp = xmlValue(e[["temp"]]), 
      runtype = xmlValue(e[["runtype"]]), 
      sample = xmlAttrs(s)[["id"]], 
      test = xmlAttrs(t)[["name"]], 
      order = xmlAttrs(t)[["order"]], 
      code = xmlValue(t[["code"]]), 
      validuntil = xmlValue(t[["validuntil"]]), 
      baseline = xmlValue(t["meas"][[1]]), 
      std = xmlValue(t["meas"][[2]]), 
      data = xmlValue(t["meas"][[3]]), 
      calc = xmlValue(t[["calc"]]), 
      result = xmlValue(t[["result"]]), 
      date = xmlAttrs(e)[["date"]], 
      time = xmlAttrs(e)[["time"]] 
     ) 
})) 

,并提供:

experiment technician location temp runtype sample test order 
1  abc123  "John"  "CO" 21.3 "routine" 2323 laslum  3 
2  abc123  "John"  "CO" 21.3 "routine" 2323 atr  1 
3  abc123  "John"  "CO" 21.3 "routine" 8979237 absat  2 
     code validuntil baseline std data calc result  date 
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131  "OK" 20150731 
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731 
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731 
    time 
1 113322 
2 113322 
3 113322 

注1:

顺便说一句,如果你读取输入的XML文件,SEWL.xml,到Excel就会把做一个合理的工作它变成了表格格式,虽然需要进一步处理才能将其精确地转换成问题中的表格。

注2:

作为R对象的输入Lines是:

Lines <- '<?xml version="1.0" encoding="UTF-8"?> 
<experiment name="abc123" date="20150731" time="113322"> 
    <technician>"John"</technician> 
    <location>"CO"</location> 
    <temp scale="celsius">21.3</temp> 
    <runtype>"routine"</runtype> 
    <sample id="2323"> 
     <test name="laslum" order="3"> 
      <code>"LL18179"</code> 
      <validuntil>"2016/08"</validuntil> 
      <meas name="baseline">0.3248</meas> 
      <meas name="std">5.4389</meas> 
      <meas name="data">6.5980</meas> 
      <calc>1.2131</calc> 
      <result>"OK"</result> 
     </test> 
     <test name="atr" order="1"> 
      <code>"ATR150607"</code> 
      <validuntil>"2017/05"</validuntil> 
      <meas name="baseline">0.0673</meas> 
      <meas name="std">4.9721</meas> 
      <meas name="data">10.3851</meas> 
      <calc>2.0886</calc> 
      <result>"Warning"</result> 
     </test> 
    </sample> 
    <sample id="8979237"> 
     <test name="absat" order="2"> 
      <code>"AA09453"</code> 
      <validuntil>"2016/03"</validuntil> 
      <meas name="baseline">0.0117</meas> 
      <meas name="std">5.6012</meas> 
      <meas name="data">1.1431</meas> 
      <calc>0.2041</calc> 
      <result>"FAIL"</result> 
     </test> 
    </sample> 
</experiment>' 
+0

这似乎是在正确的方向。如何通过调用实际的XML文件来替换Lines对象? – Variax

+0

删除'asText = TRUE'并使用文件名代替'Lines'。为了在SO上显示,我们使用字符串输入来保持演示文稿独立。 –

+0

这个技巧。非常感谢 – Variax