如何使用循环来抓取R中多个网页的网站数据？

我想应用一个循环来从R中的多个网页中抓取数据。我能够抓取一个网页的数据，但是当我尝试为多个页面使用一个循环时，我得到一个令人沮丧的错误。我花了数小时修补，无济于事。任何帮助将不胜感激！！！如何使用循环来抓取R中多个网页的网站数据？

这工作：

########################### 
# GET COUNTRY DATA 
########################### 

library("rvest") 

site <- paste("http://www.countryreports.org/country/","Norway",".htm", sep="") 
site <- html(site) 

stats<- 
    data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
     facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
     stringsAsFactors=FALSE) 

stats$country <- "Norway" 
stats$names <- gsub('[\r\n\t]', '', stats$names) 
stats$facts <- gsub('[\r\n\t]', '', stats$facts) 
View(stats)

然而，当我试图在一个循环来写这篇文章，我收到一条错误

########################### 
# ATTEMPT IN A LOOP 
########################### 

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain") 

for(i in country){ 

site <- paste("http://www.countryreports.org/country/",country,".htm", sep="") 
site <- html(site) 

stats<- 
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
     facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
     stringsAsFactors=FALSE) 

stats$country <- country 
stats$names <- gsub('[\r\n\t]', '', stats$names) 
stats$facts <- gsub('[\r\n\t]', '', stats$facts) 

stats<-rbind(stats,stats) 
stats<-stats[!duplicated(stats),] 
}

错误：

Error: length(url) == 1 is not TRUE 
In addition: Warning message: 
In if (grepl("^http", x)) { : 
    the condition has length > 1 and only the first element will be used

来源

2015-01-08 Chris L

相同的结果在这里。我试过这段代码，即使在非循环工作时也得到相同的错误信息！ >长度（站点） [1] 7 > stopifnot（长度（站点）== 1）错误：长度（站点）== 1不是TRUE – lawyeR

在此行上：'site < - paste（“http：/ /www.countryreports.org/country/",country,".htm“，sep =”“）'您正在使用'country'，即在循环版本中，与您所有国家/地区的字符向量。你可能想要'i'这是你的国家媒介的一个元素。 – zelite

zelite - 让我更加接近 - 谢谢。 –

最后工作的代码：

########################### 
# THIS WORKS!!!! 
########################### 

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain") 

for(i in country){ 

site <- paste("http://www.countryreports.org/country/",i,".htm", sep="") 
site <- html(site) 

stats<- 
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
    facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
     stringsAsFactors=FALSE) 

stats$nm <- i 
stats$names <- gsub('[\r\n\t]', '', stats$names) 
stats$facts <- gsub('[\r\n\t]', '', stats$facts) 
#stats<-stats[!duplicated(stats),] 
all<-rbind(all,stats) 

} 
View(all)

来源

2015-01-09 03:15:46

这真的对你有用吗？为了做类似的事情，所以运行你的代码并收到以下错误：rep（xi，length.out = nvar）中的错误：试图复制'builtin'类型的对象。你之前在某个地方发起过“全部”吗？ –

这就是我所做的。这不是最好的解决方案，但你会得到一个输出。这也只是一个解决方法。我不建议您在运行循环时将表输出写入文件。干得好。输出从stats生成后，

output<-rbind(stats,i)

然后写表，

write.table(output, file = "D:\\Documents\\HTML\\Test of loop.csv", row.names = FALSE, append = TRUE, sep = ",") 

#then close the loop 
}

好运

来源

2016-09-20 12:58:59

就initalize循环之前的空数据帧。我已经做了这个问题，下面的代码适合我。

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain") 
df <- data.frame(names = character(0),facts = character(0),nm = character(0)) 

for(i in country){ 

    site <- paste("http://www.countryreports.org/country/",i,".htm", sep="") 
    site <- html(site) 

    stats<- 
    data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
       facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
       stringsAsFactors=FALSE) 

    stats$nm <- i 
    stats$names <- gsub('[\r\n\t]', '', stats$names) 
    stats$facts <- gsub('[\r\n\t]', '', stats$facts) 
    #stats<-stats[!duplicated(stats),] 
    #all<-rbind(all,stats) 
    df <- rbind(df, stats) 
    #all <- merge(Output,stats) 

} 
View(df)

来源

2018-01-08 05:44:18 Premal

如何使用循环来抓取R中多个网页的网站数据？

回答

相关问题