2017-04-19 48 views
1

以下代码转到R Journal的Accepted articles页面,并下载第一篇文章,其中包含关注多个链接rverst :: follow_link()

library(rvest) 
library(magrittr) 
url_stem <- html_session("https://journal.r- 
project.org/archive/accepted/") 
url_paper <- follow_link(url_stem, "package") %>% 
    follow_link("package") -> url_article 
download.file(url_article$url, destfile = "article.pdf") 

我想什么是下载所有具有从一组给定的话的一个或多个mathing字的文章。

由于follow_link()需要一个表达式,因此我试图循环搜索条件 - 考虑到函数在未找到匹配链接的情况下返回错误的事实。

library(rvest) 
library(magrittr) 
url_stem <- html_session("https://journal.r-project.org/archive/accepted/") 
search_terms <- c("package", "model", "linear") 
tryCatch(
    for(i in search_terms) { 
    url_paper <- follow_link(url_stem, search_terms[i]) %>% 
    follow_link(search_terms[i]) -> url_article 
    download.file(url_article$url, destfile = "article.pdf") # Don't how I would write it as article[i=1,2, ...].pdf 
} 
) 

我收到以下错误:

Error in if (!any(match)) { : missing value where TRUE/FALSE needed 

This线程是不是有用,因为它解决了tags的情况。这个问题似乎很简单,可能会以更简单的方式解决,但这可能是因为R期刊网站非常整洁。有些网站相当混乱。

回答

1

如果这是您正在尝试解决的问题(找到r包含'包'的日记条目),而不是另一个站点的较大抓取任务的较小示例,那么您可以这样做:

library(xml2) 
library(stringi) 
library(tidyverse) 

doc <- xml_ns_strip(read_xml("https://journal.r-project.org/rss.atom")) 

xml_find_all(doc, "//entry[contains(., 'ackage')]") %>% 
    map_chr(~{ 
    xml_find_first(.x, ".//link") %>% 
     xml_attr("href") %>% 
     stri_replace_last_fixed("/index.html", "") %>% 
     stri_replace_last_regex("/(RJ-.*)$", "/$1/$1.pdf") 

## [1] "https://journal.r-project.org/archive/2017/RJ-2017-003/RJ-2017-003.pdf" 
## [2] "https://journal.r-project.org/archive/2017/RJ-2017-005/RJ-2017-005.pdf" 
## [3] "https://journal.r-project.org/archive/2017/RJ-2017-006/RJ-2017-006.pdf" 
## [4] "https://journal.r-project.org/archive/2017/RJ-2017-008/RJ-2017-008.pdf" 
## [5] "https://journal.r-project.org/archive/2017/RJ-2017-010/RJ-2017-010.pdf" 
## [6] "https://journal.r-project.org/archive/2017/RJ-2017-011/RJ-2017-011.pdf" 
## [7] "https://journal.r-project.org/archive/2017/RJ-2017-015/RJ-2017-015.pdf" 
## [8] "https://journal.r-project.org/archive/2017/RJ-2017-012/RJ-2017-012.pdf" 
## [9] "https://journal.r-project.org/archive/2017/RJ-2017-016/RJ-2017-016.pdf" 
## [10] "https://journal.r-project.org/archive/2017/RJ-2017-014/RJ-2017-014.pdf" 
## [11] "https://journal.r-project.org/archive/2017/RJ-2017-018/RJ-2017-018.pdf" 
## [12] "https://journal.r-project.org/archive/2017/RJ-2017-019/RJ-2017-019.pdf" 
## [13] "https://journal.r-project.org/archive/2017/RJ-2017-021/RJ-2017-021.pdf" 
## [14] "https://journal.r-project.org/archive/2017/RJ-2017-022/RJ-2017-022.pdf" 
## [15] "https://journal.r-project.org/archive/2016/RJ-2016-031/RJ-2016-031.pdf" 
## [16] "https://journal.r-project.org/archive/2016/RJ-2016-032/RJ-2016-032.pdf" 
## [17] "https://journal.r-project.org/archive/2016/RJ-2016-033/RJ-2016-033.pdf" 
## [18] "https://journal.r-project.org/archive/2016/RJ-2016-034/RJ-2016-034.pdf" 
## [19] "https://journal.r-project.org/archive/2016/RJ-2016-036/RJ-2016-036.pdf" 
## [20] "https://journal.r-project.org/archive/2016/RJ-2016-041/RJ-2016-041.pdf" 
## [21] "https://journal.r-project.org/archive/2016/RJ-2016-043/RJ-2016-043.pdf" 
## [22] "https://journal.r-project.org/archive/2016/RJ-2016-045/RJ-2016-045.pdf" 
## [23] "https://journal.r-project.org/archive/2016/RJ-2016-046/RJ-2016-046.pdf" 
## [24] "https://journal.r-project.org/archive/2016/RJ-2016-047/RJ-2016-047.pdf" 
## [25] "https://journal.r-project.org/archive/2016/RJ-2016-048/RJ-2016-048.pdf" 
## [26] "https://journal.r-project.org/archive/2016/RJ-2016-050/RJ-2016-050.pdf" 
## [27] "https://journal.r-project.org/archive/2016/RJ-2016-052/RJ-2016-052.pdf" 
## [28] "https://journal.r-project.org/archive/2016/RJ-2016-054/RJ-2016-054.pdf" 
## [29] "https://journal.r-project.org/archive/2016/RJ-2016-055/RJ-2016-055.pdf" 
## [30] "https://journal.r-project.org/archive/2016/RJ-2016-056/RJ-2016-056.pdf" 
## [31] "https://journal.r-project.org/archive/2016/RJ-2016-057/RJ-2016-057.pdf" 
## [32] "https://journal.r-project.org/archive/2016/RJ-2016-058/RJ-2016-058.pdf" 
## [33] "https://journal.r-project.org/archive/2016/RJ-2016-059/RJ-2016-059.pdf" 
## [34] "https://journal.r-project.org/archive/2016/RJ-2016-060/RJ-2016-060.pdf" 
## [35] "https://journal.r-project.org/archive/2016/RJ-2016-062/RJ-2016-062.pdf" 

RSS源是一个更适合刮来源。

即使这不是特定的任务,我觉得这条线:

xml_find_all(doc, "//entry[contains(., 'ackage')]") 

最终是你追求的。该查找所有entry标签在后代的任何位置都具有该字符串。可以在[](即逻辑链中的多个)使用XPath布尔逻辑。