问题是,我敢肯定,它很简单。但我无法弄清楚如何使它工作。我有这样四个网站这样:创建一个html()元素列表
require(xml2)
require(rvest)
html1 <- html("http://academic.research.microsoft.com/RankList?entitytype=4&topdomainid=2&subdomainid=6&last=0&orderby=6")
html2 <- html("http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=6&last=0&orderby=6")
html3 <- html("http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=7&last=0&orderby=6")
html4 <- html("http://academic.research.microsoft.com/RankList?entitytype=4&topDomainID=2&subDomainID=7&last=0&orderby=6")
htmlPages <- c(html1,html2,html3,html4)
我试图将它们放置在列表的所有内部为方便内部for循环或东西。将它们放在列表中是没有问题的。问题是稍后访问它们。我的意思是我可以从节点获取文本。
getCSSElementText <- function(htmlpage, CSSElement)
{
#Return a vector of the text values of the CSS element the function is looking for
cssNodes <- html_nodes(htmlpage, CSSElement)
cssValues <- html_text(cssNodes)
return(cssValues)
}
正如我所说
getCSSElementText(htmlPages[1], #properCSSTag#)
我得到这个错误:
错误UseMethod( “xml_find_all”): 为 'xml_find_all' 不适用的方法应用于班级对象“名单”
这里是我的全部代码,以防万一出事了别的地方:
library(rvest)
library(xml2)
html1 <- html("http://academic.research.microsoft.com/RankList?entitytype=4&topdomainid=2&subdomainid=6&last=0&orderby=6")
html2 <- html("http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=6&last=0&orderby=6")
html3 <- html("http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=7&last=0&orderby=6")
html4 <- html("http://academic.research.microsoft.com/RankList?entitytype=4&topDomainID=2&subDomainID=7&last=0&orderby=6")
htmlPages <- c(html1,html2,html3,html4)
CSSElementIDs <- c("#ctl00_MainContent_divRankList a", ".staticOrderCol:nth-child(3)", ".staticOrderCol:nth-child(4)")
getCSSElementText <- function(htmlpage, CSSElement)
{
#Return a vector of the text values of the CSS element the function is looking for
cssNodes <- html_nodes(htmlpage, CSSElement)
cssValues <- html_text(cssNodes)
return(cssValues)
}
getCSSElementNumber <- function(htmlpage, CSSElement)
{
#Return a vector of numbers with proper formatting etc from the CSS element the function is looking for
cssNodes <- html_nodes(htmlpage, CSSElement)
cssValues <- html_text(cssNodes)
parsedCssValues <- as.numeric(gsub("\\D", "", cssValues))
return(parsedCssValues)
}
addToDataFrame <- function(df, vector)
{
df[deparse(substitute(vector))] <- vector
return(df)
}
非常感谢您的宝贵时间!
尝试使用'htmlPages < - list(html1, html2,html3,html4)'然后 'getCSSElementText(htmlPages [[1]],#properCSSTag#)'(两个方括号)。 –
'html()'已弃用,请使用'read_html' –
通常,处理这种情况的最简单方法是制作一个URL向量(或者你需要做的),然后用'lapply'或'声音:地图或变体。从一开始就并行,而不是一半。 – alistaire