2014-02-28 72 views
1

我倒过相关的问题无济于事。我需要根据我指定的日期和小时从ASP.NET网页(http://www.spp.org/LIP.asp)中刮取价格信息表。我很满意并想使用R.我的基本绊脚石是URL不反映搜索参数,它是静态的,我也不知道如何在ASP.NET中使用Javascript提交HTML表单现场。使用R按钮从ASP.NET网页中刮取表格

我查看了上面URL的源代码。我发现在iframe中有一个链接指向另一个'源数据'页面:http://www.spp.org/LIPPosting/LIP.aspx。我尝试在R基于这个StackOverflow线程做一个POST请求:What if I want to web scrape with R for a page with parameters?

##ASP.NET site scrape 
forms = getHTMLFormDescription("http://www.spp.org/LIPPosting/LIP.aspx") 
# Name the list for easy reference 
names(forms)='spp' 
# Use the createFunction tool so I can submit a search 
fun = createFunction(forms$spp, verbose=T) 
# Submit an HTML form looking for data using all form defaults 
# Except change the hour to '03' 
results <- fun(ddlHour = '03') 
# Grab the table results from the HTML based on its id tag 
tableData <- getNodeSet(htmlParse(results), "//*/table[@id = 'dgLIP']") 
readHTMLTable(tableData[[1]]) 

HTML结果显示在'小时'表单元素中,我确实选择了'03'。

  <td style="height: 42px; width: 77px;"> 
<span id="lblLIPHour">Hour</span><br><select name="ddlHour" id="ddlHour"><option value="1">01</option> 
<option value="2">02</option> 
<option selected value="3">03</option> 
<option value="4">04</option> 
<option value="5">05</option> 
<option value="6">06</option> 
<option value="7">07</option> 
<option value="8">08</option> 

然而,这一请求没有得到传递给服务器,因为当我看看实际的结果见表是当前时间,而不是“03”。

> readHTMLTable(tableData[[1]]) 
    Publish Date Price Date    PNode Price  Parent PNode Settlement Location 
1 201402281552 201402281600     AECI 23.45    AECI    AECI 
2 201402281552 201402281600     AMRN 23.45    AMRN    AMRN 
3 201402281552 201402281600     BLKW 23.45    BLKW    BLKW 
4 201402281552 201402281600     CLEC 23.45    CLEC    CLEC 
5 201402281552 201402281600   CSWS_AECC_LA 23.45  CSWS_AECC_LA   AECC_CSWS 

此外,我只能得到从服务器返回的页面的HTML,它不包含所有的结果。实际上,该页面底部有JavaScript箭头按钮,可让我在网页中选中所有结果。

在网页本身,要从下拉菜单中选择后查看结果,我必须点击'查看'按钮。有没有一种方法可以在R中复制这个以获取我的'03'参数作为查询发送到服务器以将新的HTML返回到网页?

如果我能做到这一点,我可以写些东西来“推”页面箭头。

+0

我希望别人会给你一个更乐观的理由,但我的建议是不要做它。在selenium驱动程序中使用python,即使你事先不知道python也会容易得多。我说这是一个热爱R并试图将其用于一切的人,但在这种情况下,我认为这不是适合工作的正确工具。 – Ista

+0

谢谢Ista ......在进入这个小小的泡菜之前,我从来没有听说过硒。你认为他们在通过jdharrison建议的R包使用Python驱动程序方面是一个优势吗? – sclarky

回答

2

您可以使用Selenium。见http://johndharrison.github.io/RSelenium/。免责声明我是RSelenium软件包的作者。在操作的基本小品可以在RSelenium basics进行查看和 RSelenium: Testing Shiny apps

require(RSelenium) 
# RSelenium::startServer() # if needed 
remDr <- remoteDriver() 
remDr$open() 
remDr$setImplicitWaitTimeout(3000) 
remDr$navigate("http://www.spp.org/LIP.asp") 
remDr$switchToFrame("content_frame") 
dateElem <- remDr$findElement(using = "id", "txtLIPDate") # select the date 
dateRequired <- "01/14/2014" 
dateElem$clearElement() 
dateElem$sendKeysToElement(list("01/14/2014", key = "enter")) # send a date to app 
hourElem <- remDr$findElement(using = "css selector", '#ddlHour [value="5"]') # select the 5th hour 
hourElem$clickElement() # select this hour 
buttonElem <-remDr$findElement(using = "id", "cmdView") 
buttonElem$clickElement() # click the view button 

#Sys.sleep(5) 
tableElem <- remDr$findElement(using = "id", "dgLIP") 
readHTMLTable(htmlParse(tableElem$getElementAttribute("outerHTML")[[1]])) 

[1] "tableElem$getElementAttribute(\"outerHTML\")" 
$dgLIP 
V1   V2     V3 V4     V5     V6 
1 Publish Date Price Date    PNode Price  Parent PNode Settlement Location 
2 201401132252 201401132300     AECI 19.14    AECI    AECI 
3 201401132252 201401132300     AMRN 18.87    AMRN    AMRN 
4 201401132252 201401132300     BLKW 20.28    BLKW    BLKW 
5 201401132252 201401132300     CLEC 18.99    CLEC    CLEC 
6 201401132252 201401132300   CSWS_AECC_LA 19.77  CSWS_AECC_LA   AECC_CSWS 
7 201401132252 201401132300 CSWS_GREEN_LIGHT_LA 18.5 CSWS_GREEN_LIGHT_LA  GSEC_GL_CSWS 
8 201401132252 201401132300    CSWS_LA 19.01    CSWS_LA   AEPM_CSWS 
9 201401132252 201401132300    CSWS_LA 19.01    CSWS_LA   AEP_LOSS 
10 201401132252 201401132300   CSWS_OMPA_LA 18.66  CSWS_OMPA_LA   OMPA_CSWS 
11 201401132252 201401132300  CSWS_TENASKA_LA 18.95  CSWS_TENASKA_LA  GATEWAY_LOAD 
12 201401132252 201401132300  CSWS112_WGORLD1 18.7    CSWS_LA   AEPM_CSWS 
13 201401132252 201401132300  CSWS112_WGORLD1 18.7    CSWS_LA   AEP_LOSS 
14 201401132252 201401132300  CSWS116PEORILD1 18.9    CSWS_LA   AEPM_CSWS 
15 201401132252 201401132300  CSWS116PEORILD1 18.9    CSWS_LA   AEP_LOSS 
16 201401132252 201401132300 CSWS121EASTLDXFL1 18.92    CSWS_LA   AEPM_CSWS 
17 201401132252 201401132300 CSWS121EASTLDXFL1 18.92    CSWS_LA   AEP_LOSS 
18 201401132252 201401132300  CSWS121LYNN4LD1 18.91    CSWS_LA   AEPM_CSWS 
19 201401132252 201401132300  CSWS121LYNN4LD1 18.91    CSWS_LA   AEP_LOSS 
20 201401132252 201401132300 CSWS12TH_STLD69_12 18.92    CSWS_LA   AEPM_CSWS 
21 201401132252 201401132300 CSWS12TH_STLD69_12 18.92    CSWS_LA   AEP_LOSS 
22 201401132252 201401132300 CSWS12TH_STLD69_12_2 18.92    CSWS_LA   AEPM_CSWS 
23 201401132252 201401132300 CSWS12TH_STLD69_12_2 18.92    CSWS_LA   AEP_LOSS 
24 201401132252 201401132300  CSWS136_YALELD1 18.9    CSWS_LA   AEPM_CSWS 
25 201401132252 201401132300  CSWS136_YALELD1 18.9    CSWS_LA   AEP_LOSS 
26 201401132252 201401132300 CSWS141_PINELDXFMR1 19.09    CSWS_LA   AEPM_CSWS 
27   < >   <NA>     <NA> <NA>    <NA>    <NA> 
+0

好吧,我很好奇!我打算给它一个星期一 – sclarky

+0

我被困在'remDr $ open()'得到错误'错误在函数(类型,味精,asError = TRUE):无法连接到主机'。我使用devtools包在R中安装并从GitHub下载。 – sclarky

+0

@sclarky您需要运行硒服务器请参阅RSelenium基础知识小插件。 – jdharrison

0

对于后人,我想也就忍了我使用结果页面的页面点击代码(有没有“全部显示”选项) 。我有RSelenium点击所有页面,直到不再有“前进点击”选项。在每一页刮擦HTML表到一个列表:

# Get the first page of results 
tableElem <- remDr$findElement(using = "id", "dgLIP") 
tmp <- readHTMLTable(htmlParse(tableElem$getElementAttribute("outerHTML")[[1]])) 
hourlyData <- list() 
# Save the first table without the last row, which is gibberish 
hourlyData[[1]] <- tmp[[1]][-27,] 

# Click the 'greater than' arrow javascript href element to get to next page 
acc <- 2 
while("javascript:__doPostBack('dgLIP$_ctl29$_ctl1','')" %in% unlist(lapply(remDr$findElements("css selector", "[href]"), function(x){x$getElementAttribute("href")}))) { 
    webElems <- remDr$findElements("css selector", "[href]") 
    clickers <- unlist(lapply(webElems, function(x){x$getElementAttribute("href")})) 
    pager <- webElems[[which(clickers == "javascript:__doPostBack('dgLIP$_ctl29$_ctl1','')")]] 
    pager$clickElement() 
    tableElem <- remDr$findElement(using = "id", "dgLIP") 
    tmp <- readHTMLTable(htmlParse(tableElem$getElementAttribute("outerHTML")[[1]])) 
    hourlyData[[acc]] <- tmp[[1]] 
    acc <- acc + 1 
    Sys.sleep(3) 
}