使用XML包通过id和class解析HTML元素

是否可以通过id和class信息从HTMLInternalDocument对象中提取元素？例如让我们的文档：使用XML包通过id和class解析HTML元素

<!DOCTYPE html> 
<html> 
<head> 
    <title>R XML test</title> 
</head> 
<body> 
<div id="obj1"> 
    <p id="txt1">quidquid</p> 
    <p id="txt2">Latine dictum</p> 
</div> 
<div class="mystuff"> 
    <p>sit altum</p> 
    <p>videtur</p> 
</div> 
</body> 
</html>

，并读入R作为如下：

require(XML) 
file <- "C:/filepath/index.html" 
datain <- htmlTreeParse(readLines(file), useInternalNodes = TRUE)

我想提取元素的含量id='txt2'和class='mystuff'。

我已经尝试过各种方法没有成功，他们都似乎迭代了很痛苦的树。有没有使用class/id的快捷方式？我有一个想法，它可能涉及使用第一getNodeSet其次是一些应用方法（例如xmlApply & xmlAttrs），但没有我试过的作品。感谢任何指针。

来源

2015-01-06 geotheory

什么 “内容” 你的意思是，文本？试试'cat（sapply（datain ['// * [@ id =“txt2”] | // * [@ class =“mystuff”]']，xmlValue））''。 – lukeA

看起来很有希望。原谅我的无知，但我还没有在'datain ['// * [@ id =“txt2”]']之前看到这个表达式是XML库方法吗？ – geotheory

有关详细信息，请查看'getNodeSet'下的帮助：'getNodeSet（datain，'// * [@ id =“txt2”]'）'。 – lukeA

试试这个，例如：

id_or_class_xp <- "//p[@id='txt2']//text() | //div[@class='mystuff']//text()" 
xpathSApply(doc,id_or_class_xp,xmlValue) 

[1] "Latine dictum" "\n "  "sit altum"  "\n "  "videtur"  "\n"

其中DOC是：

doc <- htmlParse('<!DOCTYPE html> 
<html> 
<head> 
    <title>R XML test</title> 
</head> 
<body> 
<div id="obj1"> 
    <p id="txt1">quidquid</p> 
    <p id="txt2">Latine dictum</p> 
</div> 
<div class="mystuff"> 
    <p>sit altum</p> 
    <p>videtur</p> 
</div> 
</body> 
</html>',asText=T)

来源

2015-01-06 12:56:32 agstudy

感谢agstudy（和@lukeA）这很有帮助 – geotheory

使用XML包通过id和class解析HTML元素

回答

相关问题