2017-07-04 10 views
0

我从这个网页,其中不提供API或可下载列表挖掘物种的数据:麻烦用绳子从开采导致rvest

library(rvest) 
     moltres<-1:30 
    for (i in moltres){ 
     temphtml<-read_html(paste0("http://checklist.aou.org/taxa/",i)) %>% 
     html_node("section") %>% 
     html_text() 
     pidgey<-rbind(pidgey, temphtml) 
     } 

成果就这样产生了,对名单上的每个项目:

"\n \n  species: \n  Chen canagica (Emperor Goose, Oie empereur)\n \n\n\n\nNOTE: This is an invalidated taxon. It is a 'synonym' for 12681, which has superseded it.\n\n\n\n\t\n Compare AOU treatments of \n \n  Chen canagica,\n in Avibase\n  (1886 to present).\n \n\n\tSearch for \n \n  Chen canagica\n at Cornell Birds of North America.\n \n\n\n\n\n Annotation: Monotypic.\n\n\n\n\n\n\n\n\n\t" 

我想在每一个提取码12681“这是一个‘代名词’为12681” (这些物种的跟上时代的名字)

我试过:

pidgey$sub<-sub(".*synonim (.*?)\\,.*", "\\1", pidgey) 

但它是一个很大的混乱与我rvested原来的名单,并在年底有不包含我想要的一个专栏中,我认为这与文本格式做, 我感谢您的帮助极大

+0

你拼写同义词在你的正则表达式中是错误的,你没有在同义词之后解释''''。试试'synonym'。*?([0-9] *),或者其他接近它的东西? – Isaac

回答

0

我不知道,如果是由于语言环境的文本更改,但这将匹配“同义词”或“synonim”并获得#你的愿望:

library(rvest) 
library(dplyr) 
library(purrr) 
library(stringi) 

moltres <- 1:30 

pb <- progress_estimated(length(moltres)) 
map_df(moltres, ~{ 

    pb$tick()$print() 

    Sys.sleep(sample(1:5, 1)) # be kind, you have time and the resource is free 

    pg <- read_html(sprintf("http://checklist.aou.org/taxa/%s", .x)) 

    data_frame(
    res = .x, 
    txt = html_node(pg, "section") %>% html_text() 
) 

}) -> xdf 

xdf$synon <- stri_match_first_regex(xdf$txt, "'synon[yi]m' for ([[:digit:]]+)")[,2] 

select(xdf, synon) %>% 
    print(n=30) 
## # A tibble: 30 x 1 
## synon 
## <chr> 
## 1 <NA> 
## 2 <NA> 
## 3 <NA> 
## 4 <NA> 
## 5 <NA> 
## 6 <NA> 
## 7 <NA> 
## 8 <NA> 
## 9 <NA> 
## 10 <NA> 
## 11 <NA> 
## 12 <NA> 
## 13 <NA> 
## 14 <NA> 
## 15 <NA> 
## 16 12681 
## 17 12691 
## 18 12701 
## 19 <NA> 
## 20 <NA> 
## 21 <NA> 
## 22 <NA> 
## 23 <NA> 
## 24 <NA> 
## 25 <NA> 
## 26 <NA> 
## 27 <NA> 
## 28 <NA> 
## 29 <NA> 
## 30 <NA> 
+0

非常感谢!尽管我的拼写错误,它仍然有效!真棒! –