2015-01-08 88 views
4

我想应用一个循环来从R中的多个网页中抓取数据。我能够抓取一个网页的数据,但是当我尝试为多个页面使用一个循环时,我得到一个令人沮丧的错误。我花了数小时修补,无济于事。任何帮助将不胜感激!!!如何使用循环来抓取R中多个网页的网站数据?

这工作:

########################### 
# GET COUNTRY DATA 
########################### 

library("rvest") 

site <- paste("http://www.countryreports.org/country/","Norway",".htm", sep="") 
site <- html(site) 

stats<- 
    data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
     facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
     stringsAsFactors=FALSE) 

stats$country <- "Norway" 
stats$names <- gsub('[\r\n\t]', '', stats$names) 
stats$facts <- gsub('[\r\n\t]', '', stats$facts) 
View(stats) 

然而,当我试图在一个循环来写这篇文章,我收到一条错误

########################### 
# ATTEMPT IN A LOOP 
########################### 

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain") 

for(i in country){ 

site <- paste("http://www.countryreports.org/country/",country,".htm", sep="") 
site <- html(site) 

stats<- 
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
     facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
     stringsAsFactors=FALSE) 

stats$country <- country 
stats$names <- gsub('[\r\n\t]', '', stats$names) 
stats$facts <- gsub('[\r\n\t]', '', stats$facts) 

stats<-rbind(stats,stats) 
stats<-stats[!duplicated(stats),] 
} 

错误:

Error: length(url) == 1 is not TRUE 
In addition: Warning message: 
In if (grepl("^http", x)) { : 
    the condition has length > 1 and only the first element will be used 
+0

相同的结果在这里。我试过这段代码,即使在非循环工作时也得到相同的错误信息! >长度(站点) [1] 7 > stopifnot(长度(站点)== 1) 错误:长度(站点)== 1不是TRUE – lawyeR

+1

在此行上:'site < - paste(“http:/ /www.countryreports.org/country/",country,".htm“,sep =”“)'您正在使用'country',即在循环版本中,与您所有国家/地区的字符向量。你可能想要'i'这是你的国家媒介的一个元素。 – zelite

+0

zelite - 让我更加接近 - 谢谢。 –

回答

5

最后工作的代码:

########################### 
# THIS WORKS!!!! 
########################### 

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain") 

for(i in country){ 

site <- paste("http://www.countryreports.org/country/",i,".htm", sep="") 
site <- html(site) 

stats<- 
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
    facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
     stringsAsFactors=FALSE) 

stats$nm <- i 
stats$names <- gsub('[\r\n\t]', '', stats$names) 
stats$facts <- gsub('[\r\n\t]', '', stats$facts) 
#stats<-stats[!duplicated(stats),] 
all<-rbind(all,stats) 

} 
View(all) 
+1

这真的对你有用吗?为了做类似的事情,所以运行你的代码并收到以下错误:rep(xi,length.out = nvar)中的错误: 试图复制'builtin'类型的对象。你之前在某个地方发起过“全部”吗? –

0

这就是我所做的。这不是最好的解决方案,但你会得到一个输出。这也只是一个解决方法。我不建议您在运行循环时将表输出写入文件。干得好。输出从stats生成后,

output<-rbind(stats,i) 

然后写表,

write.table(output, file = "D:\\Documents\\HTML\\Test of loop.csv", row.names = FALSE, append = TRUE, sep = ",") 

#then close the loop 
} 

好运

1

就initalize循环之前的空数据帧。 我已经做了这个问题,下面的代码适合我。

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain") 
df <- data.frame(names = character(0),facts = character(0),nm = character(0)) 

for(i in country){ 

    site <- paste("http://www.countryreports.org/country/",i,".htm", sep="") 
    site <- html(site) 

    stats<- 
    data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
       facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
       stringsAsFactors=FALSE) 

    stats$nm <- i 
    stats$names <- gsub('[\r\n\t]', '', stats$names) 
    stats$facts <- gsub('[\r\n\t]', '', stats$facts) 
    #stats<-stats[!duplicated(stats),] 
    #all<-rbind(all,stats) 
    df <- rbind(df, stats) 
    #all <- merge(Output,stats) 

} 
View(df)