2016-09-16 42 views
0

尝试从r-users.com检索一些信息。我使用下面的代码,我收到警告消息:xml内容似乎不是xml

XML content does not seem to be XML 

任何帮助,将不胜感激。

library(data.table) 
library(XML) 

pages <- c(1:10) 

urls <- rbindlist (lapply(pages, function(x) { 
    url <- paste("https://www.r-users.com/jobs/page/",x,"/", sep="") 
    data.frame(url) 
}), fill=TRUE) 

jobLocations <- rbindlist (apply(urls, 1, function(url) { 
    doc1 <- htmlParse (url) 
    locations <- getNodeSet(doc1, '//*[@id="mainContent"]/div[2]/ol/li/dl/dd[3]/span') 
    data.frame(sapply(locations, function(x) { xmlValue(x) })) 
    }), fill = TRUE) 
+0

如果我访问一个URL和查看源例如https://www.r-users.com/jobs/page/1/页面上没有XML(尽管它可能在后台加载XML以获得结果)。我怀疑你的错误是正确的,你解析HTML,而不是XML。 –

回答

1

rvest和purrr是用于网络刮一个强大的组合:

library(rvest) 
library(purrr) 

      # make URLs 
locations <- 1:10 %>% paste0("https://www.r-users.com/jobs/page/", .) %>% 
    # pull and parse HTML for each URL 
    map(read_html) %>% 
    # select nodes for each page's HTML 
    map(html_nodes, xpath = '//*[@id="mainContent"]/div[2]/ol/li/dl/dd[3]/span') %>% 
    # return text inside of each node 
    map(html_text) %>% 
    # simplify list to vector 
    simplify() 

head(locations) 
## [1] "Massachusetts, United States" "New York, United States"  "England, United Kingdom"  
## [4] "California, United States" "Ontario, Canada"    "Indiana, United States"