通常我会针对相关的HTML与Firefox萤火虫或类似的东西检查的页面,我们看到:。
相关的HTML片段是<div class="version-ESV result-text-style-normal text-html ">
因此,我们可以用version-ESV
类找到元素:
chapter.url <- "https://www.biblegateway.com/passage/?search=Genesis+50&version=ESV"
library(RSelenium)
RSelenium:::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(chapter.url)
webElem <- remDr$findElement('class', 'version-ESV')
webElem$highlightElement() # check visually we have the right element
的highlightElement
方法为我们提供了视觉确认,我们有HTML所需的块。最后,我们可以使用getElementAttribute
方法得到这段HTML代码:
appData <- webElem$getElementAttribute("outerHTML")[[1]]
这个HTML然后可以解析使用XML
包的诗句。
UPDATE:
包含在span
与id
与开始的各种经文“EN-ESV-”我们可以针对这个使用'//span[contains(@id,"en-ESV-")]
一个XPATH。但是,在这些代码块中,我们只希望子节点是文本节点。一旦我们发现这些文本节点,我们希望它们粘贴用空格分隔条件一起:
appXPATH <- '//span[contains(@id,"en-ESV-")]'
appFunc <- function(x){
appChildren <- xmlChildren(x)
out <- appChildren[names(appChildren) == "text"]
paste(sapply(out, xmlValue), collapse = ' ')
}
doc <- htmlParse(appData, encoding = 'UTF8') # specify encoding
results <- xpathSApply(doc, appXPATH, appFunc)
结果如下:
> head(results)
[1] "Then Joseph fell on his father's face and wept over him and kissed him."
[2] "And Joseph commanded his servants the physicians to embalm his father. So the physicians embalmed Israel."
[3] "Forty days were required for it, for that is how many are required for embalming. And the Egyptians wept for him seventy days."
[4] "And when the days of weeping for him were past, Joseph spoke to the household of Pharaoh, saying, “If now I have found favor in your eyes, please speak in the ears of Pharaoh, saying,"
[5] "‘My father made me swear, saying, “I am about to die: in my tomb that I hewed out for myself in the land of Canaan, there shall you bury me.” Now therefore, let me please go up and bury my father. Then I will return.’”"
[6] "And Pharaoh answered, “Go up, and bury your father, as he made you swear.”"
你忘了加上'remDr $的open()'。 – 2014-09-10 09:51:30
啊对不起...现在就添加 – 2014-09-10 10:05:46