2013-11-26 29 views
3

我试图让选择列表中this page什么是正确的xpath来刮这个网页?

$("#Lastname"),$(".intro"),.... 

这里使用xpathSApply我尝试:

library(XML) 
library(RCurl) 
a <- getURL('http://www.w3schools.com/jquery/trysel.asp') 
doc <- htmlParse(a) 
xpathSApply(doc,'//*[@id="selectorOptions"]') ## I can't get the right xpath 

我也试过,但没有成功:

xpathSApply(doc,'//*[@id="selectorOptions"]/div[i]') 

编辑我添加python标签,因为我也接受python解决方案。

+0

JavaScript正在运行在此页上c reate你正在寻找的内容。例如'var w3SelDescriptions = []; w3SelDescriptions.push('id =“Lastname”'的元素');' 您需要从浏览器或类似的东西获取javascript页面。 – jdharrison

+0

@jdharrison恐怕我不明白你的观点。你的意思是选择器是由这个调用创建的:'onload =“w3jQuerySelectorLoad()'? – agstudy

+0

选择器列表是由一段javascript代码创建的 – jdharrison

回答

4

以下是R的方式来获得像这样的JavaScript页面。您需要使用@Peyton指出的浏览器。 Selenium服务器是控制浏览器的好方法。我写的R硒服务器某些绑定在 https://github.com/johndharrison/RSelenium

下将允许人们访问后JavaScript源:

require(devtools) 
devtools::install_github("RSelenium", "johndharrison") 
library(RSelenium) 
library(RJSONIO) 

# one needs to have an active server running 
# the following commented out lines source the latest java binary 
# RSelenium::checkForServer() 
# RSelenium::startServer() 
# a selenium server is assummed to be running now 

remDR <- remoteDriver$new() 
remDR$open() # opens a browser usually firefox with default settings 
remDR$navigate('http://www.w3schools.com/jquery/trysel.asp') # navigate to your page 
webElem <- remDR$findElements(value = "//*[@id='selectorOptions']") # find your elememts 

# display the appropriate quantities 
cat(fromJSON(webElem[[1]]$getElementText())$value) 
> cat(fromJSON(webElem[[1]]$getElementText())$value) 
$("#Lastname") 
$(".intro") 
$(".intro, #Lastname") 
$("h1") 
$("h1, p") 
$("p:first") 
$("p:last") 
$("tr:even") 
$("tr:odd") 
$("p:first-child") 
$("p:first-of-type") 
$("p:last-child") 
$("p:last-of-typ 
..................... 

UPDATE:访问信息

更简单的方法在这种情况下是使用executeScript方法

library(RSelenium) 
RSelenium:startServer() 
remDr$open() 
remDR$navigate('http://www.w3schools.com/jquery/trysel.asp') 
remDr$executeScript("return w3Sels;")[[1]] 

> remDr$executeScript("return w3Sels;")[[1]] 
[1] "#Lastname"    ".intro"     
[3] ".intro, #Lastname"  "h1"      
[5] "h1, p"     "p:first"    
[7] "p:last"     "tr:even"    
[9] "tr:odd"     "p:first-child"   
[11] "p:first-of-type"  "p:last-child"   
[13] "p:last-of-type"   "li:nth-child(1)"  
[15] "li:nth-last-child(1)" "li:nth-of-type(2)"  
[17] "li:nth-last-of-type(2)" "b:only-child"   
[19] "h3:only-of-type"  "div > p"    
[21] "div p"     "ul + h3"    
[23] "ul ~ table"    "ul li:eq(0)"   
[25] "ul li:gt(0)"   "ul li:lt(2)"   
[27] ":header"    ":header:not(h1)"  
[29] ":animated"    ":focus"     
[31] ":contains(Duck)"  "div:has(p)"    
[33] ":empty"     ":parent"    
[35] "p:hidden"    "table:visible"   
[37] ":root"     "p:lang(it)"    
[39] "[id]"     "[id=my-Address]"  
[41] "p[id!=my-Address]"  "[id$=ess]"    
[43] "[id|=my]"    "[id^=L]"    
[45] "[title~=beautiful]"  "[id*=s]"    
[47] ":input"     ":text"     
[49] ":password"    ":radio"     
[51] ":checkbox"    ":submit"    
[53] ":reset"     ":button"    
[55] ":image"     ":file"     
[57] ":enabled"    ":disabled"    
[59] ":selected"    ":checked"    
[61] "*" 
+0

谢谢!我以前没有听说过硒!但我得到一个错误'函数错误(类型,msg,asError = TRUE):无法连接到主机'。也许是因为Firefox不是我的默认浏览器? – agstudy

+0

您是否正在运行服务器。您需要运行'#RSelenium :: checkForServer() #RSelenium :: startServer()'。我将这些行注释掉了,因为我自己包括的许多人不习惯从R下载和运行外部二进制文件。这会从http://code.google.com/p/selenium/下载二进制文件。 startServer会运行这个二进制文件。如果你不想使用包中的内置命令,你可以自己去页面下载服务器并确保它正在运行。 – jdharrison

+0

Python有能力运行Selenium,我相信它是官方支持的,所以如果你使用Python很舒服,这将是一个很好的选择。 – jdharrison

0

感谢jdharrison评论我解析了JavaScript代码以提取所有选择器。正如Peyton所提到的,由于所有的选择器都在代码中,所以在这个特殊情况下工作。

capture.output(xpathSApply(doc,'//*/script')[[6]], 
       file='test.js') 
ll <- readLines('test.js') 
ll <- ll[grepl('w3Sels.push',ll)] 
ll <- unlist(regmatches(ll, gregexpr("(?<=\\().*?(?=\\))", ll, perl=T))) 

cat(head(ll)) 
"#Lastname" ".intro" ".intro, #Lastname" "h1" "h1, p" "p:first" 
相关问题