2014-06-05 87 views
0

我试图用R来从受密码保护的网站(我有一个有效的用户名/密码)中刮取一些表数据,但尚未成功。R - RCurl从受密码保护的网站抓取数据

举一个例子,这里是登录到我的牙医网站:http://www.deltadentalins.com/uc/index.html

我曾尝试以下:

library(httr) 
download <- "https://www.deltadentalins.com/indService/faces/Home.jspx?_afrLoop=73359272573000&_afrWindowMode=0&_adf.ctrl-state=12pikd0f19_4" 
terms <- "http://www.deltadentalins.com/uc/index.html" 
values <- list(username = "username", password = "password", TARGET = "", SMAUTHREASON = "", POSTPRESERVATIONDATA = "", 
bundle = "all", dups = "yes") 
POST(terms, body = values) 
GET(download, query = values) 

我也曾尝试:

your.username <- 'username' 
your.password <- 'password' 

require(SAScii) 
require(RCurl) 
require(XML) 

agent="Firefox/23.0" 
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))) 
curl = getCurlHandle() 
curlSetOpt(
cookiejar = 'cookies.txt' , 
useragent = agent, 
followlocation = TRUE , 
autoreferer = TRUE , 
curl = curl 
) 

# list parameters to pass to the website (pulled from the source html) 
params <- 
list(
'lt' = "", 
'_eventID' = "", 
'TARGET' = "", 
'SMAUTHREASON' = "", 
'POSTPRESERVATIONDATA' = "", 
'SMAGENTNAME' = agent, 
'username' = your.username, 
'password' = your.password 
    ) 

#logs into the form 
html = postForm('https://www.deltadentalins.com/siteminderagent/forms/login.fcc', .params = params, curl = curl) 

# logs into the form 
html 

我可以不能上班。有没有可以帮助的专家?

+1

改为尝试'relenium'包\t。 –

+0

谢谢!我已经设法让它与这个软件包一起工作。 – kng229

+0

您应该发布答案以帮助其他希望执行相同操作的用户。 –

回答

1

更新16年3月5日与包Relenium

#### FRONT MATTER #### 

library(devtools) 
library(RSelenium) 
library(XML) 
library(plyr) 

###################### 

## This block will open the Firefox browser, which is linked to R 
RSelenium::checkForServer() 
remDr <- remoteDriver() 
startServer() 
remDr$open() 
url="yoururl" 
remDr$navigate(url) 

这第一部分加载所需的软件包,设置登录URL,然后打开它在Firefox实例工作。我输入我的用户名&密码,然后我进入并可以开始拼音。

infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE) 
infoTable 
Table1 <- infoTable[[1]] 
Apps <- Table1[,1] # Application Numbers 

对于本示例,第一页包含两个表。第一个是我感兴趣的,并有一个申请号和名称表。我拿出第一栏(申请号)。

Links2 <- paste("https://yourURL?ApplicantID=", Apps2, sep="") 

我想要的数据存储在invidiual应用程序中,所以这一点创建了我想要循环的链接。

### Grabs contact info table from each page 

LL <- lapply(1:length(Links2), 
function(i) { 
url=sprintf(Links2[i]) 
firefox$get(url) 
firefox$getPageSource() 
infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE) 

if("First Name" %in% colnames(infoTable[[2]]) == TRUE) infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[2]][1,]) 

else infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[3]][1,]) 

print(infoTable2) 
} 
) 

results <- do.call(rbind.fill, LL) 
results 
write.csv(results, "C:/pathway/results2.csv") 

该最终部分循环通过链接为每个应用程序,然后与他们的联系信息抓住表(其是表2或表3中,所以R具有以检查第一)。再次感谢Chinmay Patil在关键词上的提示!