2017-03-21 76 views
0

我有下面的XML文件,我想使用R进行分析。XML具有很深的结构,并且也有不同数量的子节点。提取深层XML结构

<?xml version="1.0" encoding="UTF-8"?> 

<Alert date="20161223_2" type="full"> 
<Records> 
<Person Id="100"> 
    <PersonNameDetails> 
    <PersonNames id="Name1"> 
     <ReferenceGroup ReferenceGroupCode="ABC"/> 
     <ReferenceGroup ReferenceGroupCode="DEF"/> 
     <PersonNameValue> 
     <FirstName>Carl Bangouvounda</FirstName> 
     <Surname>Toziz</Surname> 
     </PersonNameValue> 
    </PersonNames> 
    <PersonNames id="Name2"> 
     <ReferenceGroup ReferenceGroupCode="ABC"/> 
     <ReferenceGroup ReferenceGroupCode="GHI" ReferenceGroupLanguageCode="en"/> 
     <ReferenceGroup ReferenceGroupCode="JKL"/> 
     <ReferenceGroup ReferenceGroupCode="MNO"/> 
     <ReferenceGroup ReferenceGroupCode="DEF"/> 
     <PersonNameValue> 
     <FirstName>Tozize</FirstName> 
     <Surname>Bangouvonda</Surname> 
     </PersonNameValue> 
    </PersonNames> 
    <PersonNames id="Name3"> 
     <ReferenceGroup ReferenceGroupCode="MNO"/> 
     <PersonNameValue> 
     <FirstName>Carol</FirstName> 
     <Surname>Tozize</Surname> 
     </PersonNameValue> 
    </PersonNames> 
    <PersonNames id="Name4"> 
     <ReferenceGroup ReferenceGroupCode="PQR"/> 
     <ReferenceGroup ReferenceGroupCode="MNO"/> 
     <PersonNameValue> 
     <FirstName>Carol</FirstName> 
     <MiddleName>Bangouvonda</MiddleName> 
     <Surname>Tozize</Surname> 
     </PersonNameValue> 
    </PersonNames> 
    <PersonNames id="Name5"> 
     <ReferenceGroup ReferenceGroupCode="GHI" ReferenceGroupLanguageCode="en"/> 
     <ReferenceGroup ReferenceGroupCode="JKL"/> 
     <ReferenceGroup ReferenceGroupCode="DEF"/> 
     <PersonNameValue> 
     <FirstName>Carl Bangouvonda</FirstName> 
     <Surname>Toziz</Surname> 
     </PersonNameValue> 
    </PersonNames> 
    </PersonNameDetails> 
</Person> 
</Records> 
</Alert> 

预期的输出如下:

----------------------------------------------------------- 
Id | id | ReferenceGroup | FirstName | MiddleName | Surname 
----------------------------------------------------------- 
100 | Name1 | ABC, DEF | Carl Bangouvounda | NA | Toziz 
----------------------------------------------------------- 
100 | Name2 | ABC, GHI, JKL, MNO, DEF | Tozize | NA | Bangouvonda 
----------------------------------------------------------- 
100 | Name3 | MNO | Carol | NA | Tozize 
----------------------------------------------------------- 
100 | Name4 | PQR, MNO | Carol | Bangouvonda | Tozize 
----------------------------------------------------------- 
100 | Name5 | GHI, JKL, DEF | Carl Bangouvonda | NA | Toziz 
----------------------------------------------------------- 

ID是元素人的属性,且其他所有从PersonNameDetails。我也想将ReferenceGroupCode连接成同一个Personnames元素中的一个字符串。

我跟着建议转换为XSLT用下面的代码:

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> 
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="xml"/> 
<xsl:strip-space elements="*"/> 

    <xsl:template match="/Alert "> 
    <xsl:copy> 
     <xsl:apply-templates select="Records"/> 
    </xsl:copy> 
    </xsl:template> 

    <xsl:template match="Records">  
    <xsl:apply-templates select="Person"/>  
    </xsl:template> 

    <xsl:template match="Person">  
    <xsl:apply-templates select="PersonNameDetails"/>  
    </xsl:template> 

    <xsl:template match="PersonNameDetails">  
    <xsl:apply-templates select="PersonNames"/>  
    </xsl:template> 

    <xsl:template match="PersonNames">  
    <xsl:apply-templates select="PersonNameValue"/>  
    </xsl:template> 

    <xsl:template match="PersonNameValue"> 
    <PersonNameValue> 
     <Id><xsl:value-of select="ancestor::Person/@Id"/></Id> 
     <id><xsl:value-of select="ancestor::PersonNames/@id"/></id> 
     <xsl:copy-of select="FirstName"/> 
     <MiddleName><xsl:value-of select="MiddleName"/></MiddleName> 
     <Surname><xsl:value-of select="Surname"/></Surname> 
     <ReferenceGroupCode><xsl:value-of select="ancestor::PersonNames/ReferenceGroup/@ReferenceGroupCode"/></ReferenceGroupCode> 
    </PersonNameValue> 
    </xsl:template> 

</xsl:transform> 

如何更改XSLT代码,以便ReferenceGroup输出将是

<ReferenceGroupCode>ABC,DEF</ReferenceGroupCode> 

任何帮助,高度赞赏。

+0

我不希望将XML转换为XSLT。你能告诉我你需要什么样的信息来解决这个XMl解析问题吗? –

回答

0

不确定XSLT,但可以在PersonNames节点上使用xpath并编写一个函数来处理缺失值或多个值。

doc <- xmlParse("<your XML file>") 
x <- getNodeSet(doc, "//PersonNames") 
xpath2 <-function(x, ...){ 
    y <- xpathSApply(x, ...) 
    ifelse(length(y) == 0, NA, paste(y, collapse=", ")) 
} 
y <- data.frame(
    id =   sapply(x, xpath2, ".", xmlGetAttr, "id"), 
    ReferenceGroup= sapply(x, xpath2, ".//ReferenceGroup", xmlGetAttr, "ReferenceGroupCode"), 
    FirstName =  sapply(x, xpath2, ".//FirstName", xmlValue), 
    MiddleName = sapply(x, xpath2, ".//MiddleName", xmlValue), 
    Surname =  sapply(x, xpath2, ".//Surname", xmlValue) 
) 
    id   ReferenceGroup   FirstName MiddleName  Surname 
1 Name1    ABC, DEF Carl Bangouvounda  <NA>  Toziz 
2 Name2 ABC, GHI, JKL, MNO, DEF   Tozize  <NA> Bangouvonda 
3 Name3      MNO    Carol  <NA>  Tozize 
4 Name4    PQR, MNO    Carol Bangouvonda  Tozize 
5 Name5   GHI, JKL, DEF Carl Bangouvonda  <NA>  Toziz 

也许通过计算PersonName节点的数量来添加人员ID?

n <- xpathSApply(doc, "//Person/PersonNameDetails", xmlSize) 
y$ID <- rep(xpathSApply(doc, "//Person", xmlGetAttr, "Id"), n)