R：从XML数据中提取特定的节点内容

使用R和XML包（xmlTreeParse等）我尽我所能从xml文件中读取特定节点而没有成功。以下XML示例虚设表示我使用的数据：R：从XML数据中提取特定的节点内容

<item> 
<title> Mickey Mouse </title> 
<description> Cartoon </description> 
<pubDate> 25 Apr 1965 </pubDate> 
<disney:Filing web="http://www.waltdisney.com/archives"> 
<disney:fileNumber>125364</disney:fileNumber> 
<disney:assignedID>7389</disney:assignedID> 
<disney:Files> 
    <disney:File disney:set="1" disney:file="abc.mov" disney:type="B&W"/> 
    <disney:File disney:set="2" disney:file="def.mov" disney:type="Col"/> 
    <disney:File disney:set="3" disney:file="wzt.mov" disney:type="B&W"/> 
</disney:Files> 
</disney:Filing> 
</item>

我施加xpathApply成功提取第一三个节点。但我无法到达标有“迪士尼：文件”的节点。出于某种原因，迪斯尼之外的任何事情：文件是不可读的（“不可见”）。

我的目标是要么提取所有的迪士尼：文件行成一个数据框或更漂亮：首先搜索特定的迪士尼：设置和提取从这个节点单独到数据框的所有信息。任何帮助都会非常棒。提前致谢！

来源

2014-07-16 PBolbrinker

你需要在你的XPath使用的命名空间。有关更多详细信息，请参阅'xmlNamespaces'。没有问题的XML文件和我们不能帮助的命名空间定义。例如，可以使用'xpathSApply（doc，'// */disney：File'，xmlValue）'，但可能会有其他名称空间。 – jdharrison

如果你真的想要做的是得到'disney：File'数据，并且相当确定它们将在单行上，'readLines' +'grep' +'str_extract'可能就足够了。不需要因为XML而进行缓慢/浪费内存的树解析。当然，对于更复杂的提取（如果你对每个文件进行多个数据提取类型的话），那么XML解析就很有意义。 – hrbrmstr

感谢你们两位，@ jdharrison和hrbrmstr。我去readLines等，因为这个任务似乎更简单，更直接。很好的帮助！ – PBolbrinker

一些样本数据

'<?xml version="1.0"?> 
<aw:PurchaseOrder 
    aw:PurchaseOrderNumber="99503" 
aw:OrderDate="1999-10-20" 
xmlns:aw="http://www.adventure-works.com"> 
<aw:Address aw:Type="Shipping"> 
<aw:Name>Ellen Adams</aw:Name> 
<aw:Street>123 Maple Street</aw:Street> 
<aw:City>Mill Valley</aw:City> 
<aw:State>CA</aw:State> 
<aw:Zip>10999</aw:Zip> 
<aw:Country>USA</aw:Country> 
</aw:Address> 
<aw:Address aw:Type="Billing"> 
<aw:Name>Tai Yee</aw:Name> 
<aw:Street>8 Oak Avenue</aw:Street> 
<aw:City>Old Town</aw:City> 
<aw:State>PA</aw:State> 
<aw:Zip>95819</aw:Zip> 
<aw:Country>USA</aw:Country> 
</aw:Address> 
<aw:DeliveryNotes>Please leave packages in shed by driveway.</aw:DeliveryNotes> 
<aw:Items> 
<aw:Item aw:PartNumber="872-AA"> 
<aw:ProductName>Lawnmower</aw:ProductName> 
<aw:Quantity>1</aw:Quantity> 
<aw:USPrice>148.95</aw:USPrice> 
<aw:Comment>Confirm this is electric</aw:Comment> 
</aw:Item> 
<aw:Item aw:PartNumber="926-AA"> 
<aw:ProductName>Baby Monitor</aw:ProductName> 
<aw:Quantity>2</aw:Quantity> 
<aw:USPrice>39.98</aw:USPrice> 
<aw:ShipDate>1999-05-21</aw:ShipDate> 
</aw:Item> 
</aw:Items> 
</aw:PurchaseOrder>' -> xData

你可以声明namespcae这里我们使用ns给它一个标签。在这种情况下，我们可以只使用aw:Item但我们标记命名空间为例：

library(XML) 
myData <- xmlParse(xData) 
> xpathSApply(myData, "//*/ns:Item/ns:ProductName" 
       , namespaces = c(ns = "http://www.adventure-works.com") 
       , xmlValue) 
[1] "Lawnmower" "Baby Monitor"

来源

2014-07-16 16:21:20 jdharrison

R：从XML数据中提取特定的节点内容

回答

相关问题