2016-08-12 111 views
0

问题是,我敢肯定,它很简单。但我无法弄清楚如何使它工作。我有这样四个网站这样:创建一个html()元素列表

require(xml2) 
require(rvest) 
html1 <- html("http://academic.research.microsoft.com/RankList?entitytype=4&topdomainid=2&subdomainid=6&last=0&orderby=6") 

html2 <- html("http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=6&last=0&orderby=6") 

html3 <- html("http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=7&last=0&orderby=6") 

html4 <- html("http://academic.research.microsoft.com/RankList?entitytype=4&topDomainID=2&subDomainID=7&last=0&orderby=6") 

htmlPages <- c(html1,html2,html3,html4) 

我试图将它们放置在列表的所有内部为方便内部for循环或东西。将它们放在列表中是没有问题的。问题是稍后访问它们。我的意思是我可以从节点获取文本。

getCSSElementText <- function(htmlpage, CSSElement) 
{ 
    #Return a vector of the text values of the CSS element the function is looking for 

    cssNodes <- html_nodes(htmlpage, CSSElement) 
    cssValues <- html_text(cssNodes) 
    return(cssValues) 
} 

正如我所说

getCSSElementText(htmlPages[1], #properCSSTag#)

我得到这个错误:

错误UseMethod( “xml_find_all”): 为 'xml_find_all' 不适用的方法应用于班级对象“名单”

这里是我的全部代码,以防万一出事了别的地方:

library(rvest) 
library(xml2) 
html1 <- html("http://academic.research.microsoft.com/RankList?entitytype=4&topdomainid=2&subdomainid=6&last=0&orderby=6") 
html2 <- html("http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=6&last=0&orderby=6") 
html3 <- html("http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=7&last=0&orderby=6") 
html4 <- html("http://academic.research.microsoft.com/RankList?entitytype=4&topDomainID=2&subDomainID=7&last=0&orderby=6") 
htmlPages <- c(html1,html2,html3,html4) 

CSSElementIDs <- c("#ctl00_MainContent_divRankList a", ".staticOrderCol:nth-child(3)", ".staticOrderCol:nth-child(4)") 

getCSSElementText <- function(htmlpage, CSSElement) 
{ 
    #Return a vector of the text values of the CSS element the function is looking for 

    cssNodes <- html_nodes(htmlpage, CSSElement) 
    cssValues <- html_text(cssNodes) 
    return(cssValues) 
} 

getCSSElementNumber <- function(htmlpage, CSSElement) 
{ 
    #Return a vector of numbers with proper formatting etc from the CSS element the function is looking for 
    cssNodes <- html_nodes(htmlpage, CSSElement) 
    cssValues <- html_text(cssNodes) 
    parsedCssValues <- as.numeric(gsub("\\D", "", cssValues)) 
    return(parsedCssValues) 
} 

addToDataFrame <- function(df, vector) 
{ 
    df[deparse(substitute(vector))] <- vector 
    return(df) 
} 

非常感谢您的宝贵时间!

+1

尝试使用'htmlPages < - list(html1, html2,html3,html4)'然后 'getCSSElementText(htmlPages [[1]],#properCSSTag#)'(两个方括号)。 –

+1

'html()'已弃用,请使用'read_html' –

+1

通常,处理这种情况的最简单方法是制作一个URL向量(或者你需要做的),然后用'lapply'或'声音:地图或变体。从一开始就并行,而不是一半。 – alistaire

回答

2

当您连接您的html*对象(这是长度为2的每个列表),它们成为清单8:

htmlPages <- c(html1,html2,html3,html4) 
str(htmlPages) 
# List of 8 
# $ node:<externalptr> 
# $ doc :<externalptr> 
# $ node:<externalptr> 
# $ doc :<externalptr> 
# $ node:<externalptr> 
# $ doc :<externalptr> 
# $ node:<externalptr> 
# $ doc :<externalptr> 

相反,把html*对象到一个列表:

htmlPages <- list(html1,html2,html3,html4) 
str(htmlPages) 
# List of 4 
# $ :List of 2 
# ..$ node:<externalptr> 
# ..$ doc :<externalptr> 
# ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node" 
# $ :List of 2 
# ..$ node:<externalptr> 
# ..$ doc :<externalptr> 
# ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node" 
# $ :List of 2 
# ..$ node:<externalptr> 
# ..$ doc :<externalptr> 
# ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node" 
# $ :List of 2 
# ..$ node:<externalptr> 
# ..$ doc :<externalptr> 
# ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node" 

[[

htmlPages[[1]] 
# {xml_document} 
# <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml"> 
# [1] <head id="Head1">\n <meta http-equiv="Content-Type" content="text/html; ... 
# [2] <body onpageshow="document.forms['aspnetForm'].reset();">&#13;\n <form ... 
+0

它工作!非常感谢 !我会考虑使用html_read来取消已弃用的警告,它是否应以同样的方式工作? – ChowderII

相关问题