2016-12-21 57 views
-1

我是新来的网络抓取并希望将其用于感性分析。我已经成功取消了前10条评论。对于其他280条评论,我犹豫要重复以下过程超过20次......我想知道是否有一个包/功能可以让我以更简单的方式抓取所有评论?非常感谢!如何使用rvest从IMDB中删除所有电影评论

library(rvest) 
library(XML) 
library(plyr) 
HouseofCards_IMDb <- read_html("http://www.imdb.com/title/tt1856010/reviews?ref_=tt_urv") 

#Used SelectorGadget as the CSS Selector 
reviews <- HouseofCards_IMDb %>% html_nodes("#pagecontent") %>% 
html_nodes("div+p") %>% 
html_text() 

#perfrom data cleaning on user reviews 
reviews <- gsub("\r?\n|\r", " ", reviews) 
reviews <- tolower(gsub("[^[:alnum:] ]", " ", reviews)) 
sapply(reviews, function(x){}) 
print(reviews) 

回答

2

欢迎来到SO。

如果您转到第二页评论,您会注意到URL的变化从http://www.imdb.com/title/tt1856010/reviewshttp://www.imdb.com/title/tt1856010/reviews?start=10

最后一页:http://www.imdb.com/title/tt1856010/reviews?start=290

所有您需要做的是循环一翻:

result <- c() 
for(i in c(1, seq(10, 290, 10))) { 
    link <- paste0("http://www.imdb.com/title/tt1856010/reviews?start=",i) 
    HouseofCards_IMDb <- read_html(link) 

    # Used SelectorGadget as the CSS Selector 
    reviews <- HouseofCards_IMDb %>% html_nodes("#pagecontent") %>% 
    html_nodes("div+p") %>% 
    html_text() 

    # perfrom data cleaning on user reviews 
    reviews <- gsub("\r?\n|\r", " ", reviews) 
    reviews <- tolower(gsub("[^[:alnum:] ]", " ", reviews)) 
    sapply(reviews, function(x){}) 
    result <- c(result, reviews) 
} 

请注意,我们先从http://www.imdb.com/title/tt1856010/reviews?start=1这是类似http://www.imdb.com/title/tt1856010/reviews