刮在R的密码保护的网站

我试图从R的密码保护的网站刮数据周围，似乎httr和RCurl包是用密码认证刮（刮还查看了XML包）。刮在R的密码保护的网站

网站我想刮低于（你需要一个免费帐户，以访问完整的页面）： http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2

这里是我的两次尝试（与我的用户名和“取代“用户名”密码”与我的密码）：

#This returns "Status: 200" without the data from the page: 
library(httr) 
GET("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", authenticate("username", "password")) 

#This returns the non-password protected preview (i.e., not the full page): 
library(XML) 
library(RCurl) 
readHTMLTable(getURL("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", userpwd = "username:password"))

我已经看过其他相关的帖子（下面的链接），但无法弄清楚如何他们的答案适用于我的情况。

How to use R to download a zipped file from a SSL page that requires cookies

How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

Reading information from a password protected site

R - RCurl scrape data from a password-protected site

http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold

来源

2014-07-13 dadrivr

我没有一个帐户进行测试，但也许这将工作：

library(httr) 
library(XML) 

handle <- handle("http://subscribers.footballguys.com") 
path <- "amember/login.php" 

# fields found in the login form. 
login <- list(
    amember_login = "username" 
,amember_pass = "password" 
,amember_redirect_url = 
    "http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2" 
) 

response <- POST(handle = handle, path = path, body = login)

现在，响应对象可能拥有你所需要的（或者你可以直接查询后所关注的页面登录请求;我不确定重定向是否有效，但是它是Web表单中的一个字段），并且handle可能会被重新用于后续请求。无法测试它;但这在很多情况下适用于我。

您可以输出使用XML

> readHTMLTable(content(response))[[1]][1:5,] 
    Rank    Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt 
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15 
2 2  Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35 
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70 
4 4  Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95 
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60

来源

2014-07-13 17:54:20 Stefan

这对我有用。我编辑了内容输出 – jdharrison

酷！我不认为它变得更容易... – Stefan

我测试了两个答案，他们都很好。我选择这个简单。 – dadrivr

您可以使用RSelenium。我已经使用dev版本，因为您可以在没有Selenium Server的情况下运行phantomjs。

# Install RSelenium if required. You will need phantomjs in your path or follow instructions 
# in package vignettes 
# devtools::install_github("ropensci/RSelenium") 
# login first 
appURL <- 'http://subscribers.footballguys.com/amember/login.php' 
library(RSelenium) 
pJS <- phantom() # start phantomjs 
remDr <- remoteDriver(browserName = "phantomjs") 
remDr$open() 
remDr$navigate(appURL) 
remDr$findElement("id", "login")$sendKeysToElement(list("myusername")) 
remDr$findElement("id", "pass")$sendKeysToElement(list("mypass")) 
remDr$findElement("css", ".am-login-form input[type='submit']")$clickElement() 

appURL <- 'http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2' 
remDr$navigate(appURL) 
tableElem<- remDr$findElement("css", "table.datamedium") 
res <- readHTMLTable(header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]]) 
> res[[1]][1:5, ] 
Rank    Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt 
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15 
2 2  Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35 
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70 
4 4  Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95 
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60

最后，如果你想使用传统的浏览器如Firefox例如（如果你想坚持的CRAN的版本），你会使用，当你完成后，关闭phantomjs

pJS$stop()

：

RSelenium::startServer() 
remDr <- remoteDriver() 
........ 
........ 
remDr$closeServer()

代替相关的phantomjs调用。

来源

2014-07-13 16:22:59 jdharrison

由于表，这是解决这是一个非常通用的方法。 –

虽然总的来说这是一个非常有用的答案，但可以注意到，最近该软件包稍微提前了一点，允许通过chrome，firefox或IE浏览器进行更方便的浏览，而无需phantomjs，例如，使用'rD < - RSelenium：：rsDriver（port = 5555L，'firefox'）; remDr < - rD [[“client”]]''，然后跟着原始答案。 – Nutle

@Nutle好点，并幻想功能已被弃用赞成wdman :: phantomjs所以也许这个答案需要更新 – jdharrison

刮在R的密码保护的网站

回答

相关问题