2015-04-24 53 views
4

我需要获取下一页中列出的所有关注者的网页链接。从弹出窗口中提取网页

https://www.researchgate.net/topic/biotechnology

有206770名追随者在这一刻这个话题。当我点击“查看所有”按钮时,出现一个弹出窗口,其中列出了一个列表,并随着我的下降而不断扩大。

https://www.researchgate.net/profile/Kestutis_Sasnauskas ...

以上是顶跟随的链接。有没有一种方法可以让所有206770追随者获得网络链接?

回答

0

这可以使用rvestRSelenium来完成。后者主要是需要的,前者会让你的生活更轻松。从github devtools::install_github("ropensci/RSelenium")安装RSelenium。来自cran的rvest

这里是你需要完成你所寻找的代码。

siteUrl <- "http://www.researchgate.net/" 
GateUrl <- "http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000&offset=" 

library(rvest) 
library(RSelenium) 

checkForServer() 
startServer() 
remDrv <- remoteDriver() 
remDrv$open(silent = FALSE) 

i <- 0 
profileUrls <- c() 

for(j in 1:3){ 
    print(j) 
    remDrv$navigate(paste0(GateUrl, i)) 
    l <- html(remDrv$getPageSource()[[1]]) 
    profileUrls <- c(profileUrls, 
       paste0(siteUrl, l %>% html_nodes(".display-name") %>% xml_attr("href"))) 
    i <- length(profileUrls)+1 

} 

remDrv$close() 
profileUrls 

这里有几件事。你需要弄清楚j循环。我认为它会为每个网址提供38个配置文件,因此j应该与for(j in 1:(followers/38))类似。

第二点是代码在保存链接的方式方面效率不高,即每次都附加链接。更好的解决方案是使用lapplyunlist

最后一点,你需要你的机器上的Mozilla Firefox,因为这是从RSelenium使用但你可以将其设置为使用无论你掀掉最流行的浏览器的默认。

结果 从第56

> profileUrls 
[1] "http://www.researchgate.net/profile/Jose_Carbajo2"   
[2] "http://www.researchgate.net/profile/Daniele_Riccio"   
[3] "http://www.researchgate.net/profile/Fiona_Togneri2"   
[4] "http://www.researchgate.net/profile/Sukanya_Patel"   
[5] "http://www.researchgate.net/profile/Neri_Fattorini"   
[6] "http://www.researchgate.net/profile/Pham_Thi_Thuy_Van"  
[7] "http://www.researchgate.net/profile/Kestutis_Sasnauskas"  
[8] "http://www.researchgate.net/profile/Iris_Weintal"    
[9] "http://www.researchgate.net/profile/Godelieve_Verhaegen"  
[10] "http://www.researchgate.net/profile/Janani_Venkatraman2"  
[11] "http://www.researchgate.net/profile/Kai_Wang126"    
[12] "http://www.researchgate.net/profile/Irine_Ronin"    
[13] "http://www.researchgate.net/profile/Natasha_Ikhsan"   
[14] "http://www.researchgate.net/profile/Nadya_Hajar"    
[15] "http://www.researchgate.net/profile/Gayatr_Venkataraman2"  
[16] "http://www.researchgate.net/profile/Amsha_Viraragavan"  
[17] "http://www.researchgate.net/profile/Wei_Leiyan"    
[18] "http://www.researchgate.net/profile/Yosuke_Inada"    
[19] "http://www.researchgate.net/profile/Nadya_Hajar"    
[20] "http://www.researchgate.net/profile/Gayatr_Venkataraman2"  
[21] "http://www.researchgate.net/profile/Amsha_Viraragavan"  
[22] "http://www.researchgate.net/profile/Wei_Leiyan"    
[23] "http://www.researchgate.net/profile/Yosuke_Inada"    
[24] "http://www.researchgate.net/profile/Yongning_You"    
[25] "http://www.researchgate.net/profile/Susan_Hu6"    
[26] "http://www.researchgate.net/profile/Matt_Evans11"    
[27] "http://www.researchgate.net/profile/Nam_Kieu"     
[28] "http://www.researchgate.net/profile/Nur_Musa3"    
[29] "http://www.researchgate.net/profile/Varaporn_S"    
[30] "http://www.researchgate.net/profile/Askar_Begzat3"   
[31] "http://www.researchgate.net/profile/Bing_Wang63"    
[32] "http://www.researchgate.net/profile/Xuebin_Yan"    
[33] "http://www.researchgate.net/profile/Roberto_Sibaja_Hernandez" 
[34] "http://www.researchgate.net/profile/Stephen_Heimann"   
[35] "http://www.researchgate.net/profile/Hanina_Hanifa"   
[36] "http://www.researchgate.net/profile/Bo_Wang143"    
[37] "http://www.researchgate.net/profile/Xuebin_Yan"    
[38] "http://www.researchgate.net/profile/Roberto_Sibaja_Hernandez" 
[39] "http://www.researchgate.net/profile/Stephen_Heimann"   
[40] "http://www.researchgate.net/profile/Hanina_Hanifa"   
[41] "http://www.researchgate.net/profile/Bo_Wang143"    
[42] "http://www.researchgate.net/profile/Huili_Li5"    
[43] "http://www.researchgate.net/profile/Giuseppe_Infusini"  
[44] "http://www.researchgate.net/profile/Carmen_Wacher"   
[45] "http://www.researchgate.net/profile/Linyn_Linyn"    
[46] "http://www.researchgate.net/profile/Dan_Youel"    
[47] "http://www.researchgate.net/profile/Catherine_Williams16"  
[48] "http://www.researchgate.net/profile/Nichole_Macaraeg"   
[49] "http://www.researchgate.net/profile/Peter_Oroszlan"   
[50] "http://www.researchgate.net/profile/Eduard_Karamov"   
[51] "http://www.researchgate.net/profile/Mauricio_Franco3"   
[52] "http://www.researchgate.net/profile/Patricia_Zancan"   
[53] "http://www.researchgate.net/profile/Rohana_Dassanayake"  
[54] "http://www.researchgate.net/profile/Khadija_Khataby"   
[55] "http://www.researchgate.net/profile/Imane_Moest"    
[56] "http://www.researchgate.net/profile/Rory_Adey" 
0

作为替代RSelenium,你可以尝试像这样(第56名追随者为例):

library(XML) 
library(jsonlite) 
offsets <- seq(from = 1, to = 50, 18) 
urls <- sprintf("http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000&offset=%d", offsets) 

df <- data.frame() 
for (x in seq_along(urls)) { 
    doc <- htmlParse(urls[x]) 
    script <- as(doc[['//script[5]']], "character") 
    splits <- strsplit(script, '\\(function\\(\\)\\{Y\\.rg\\.createInitialWidget\\("[^\"]+",')[[1]][-1] 
    res <- lapply(splits, function(split) { 
    split <-sub(");})();\n", "", split, fixed = TRUE) 
    res <- try(as.data.frame(t(unlist(fromJSON(gsub("\\\\", "", split))))), silent = TRUE) 
    if (!inherits(res, "try-error")) return(res) else return(NULL) 
    }) 
    df <- rbind(df, do.call(rbind, res[1:(length(res)-2)])) 
} 
dplyr::glimpse(df) 
# Observations: 56 
# Variables: 
# $ _isReact               (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F... 
# $ data.displayName             (fctr) Jose Maria Carbajo, Daniele Riccio, Fiona S Togneri, Sukanya Paramashivaiah Patel, Neri Fattorini, Pham thi thuy van, Kestutis Sasnauskas, Iris Weintal, Godelieve Verhaegen, Ja... 
# $ data.profile.professionalInstitution.professionalInstitutionName (fctr) Instituto Nacional de Investigaciu00f3n y Tecnologu00eda Agraria y Alimentaria, University of Milan, Birmingham Women's NHS Foundation Trust, Himalya drug company, University o... 
# $ data.profile.professionalInstitution.professionalInstitutionUrl (fctr) institution/Instituto_Nacional_de_Investigaciones_y_Experiencias_Agronomicas_y_Forestales, institution/University_of_Milan, institution/Birmingham_Womens_NHS_Foundation_Trust, ... 
# $ data.professionalInstitutionName         (fctr) Instituto Nacional de Investigaciu00f3n y Tecnologu00eda Agraria y Alimentaria, University of Milan, Birmingham Women's NHS Foundation Trust, Himalya drug company, University o... 
# $ data.professionalInstitutionUrl         (fctr) institution/Instituto_Nacional_de_Investigaciones_y_Experiencias_Agronomicas_y_Forestales, institution/University_of_Milan, institution/Birmingham_Womens_NHS_Foundation_Trust, ... 
# $ data.url               (fctr) profile/Jose_Carbajo2, profile/Daniele_Riccio, profile/Fiona_Togneri2, profile/Sukanya_Patel, profile/Neri_Fattorini, profile/Pham_Thi_Thuy_Van, profile/Kestutis_Sasnauskas, pr... 
# $ data.imageUrl             (fctr) http://c1.rgstatic.net/m/797670414832/images/template/default/profile/profile_default_m.jpg, http://i1.rgstatic.net/i/profile/54a1a5539f8e2f289f_m_25d91.jpg, http://i1.rgstatic... 
# $ data.imageSize             (fctr) m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m 
# $ data.imageHeight             (fctr) 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ... 
# $ data.imageWidth             (fctr) 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ... 
# $ data.enableFollowButton           (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR... 
# $ data.enableHideButton           (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F... 
# $ data.enableConnectionButton          (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F... 
# $ data.isClaimedAuthor            (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR... 
# $ data.hasExtraContainer           (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F... 
# $ data.showStatsWidgets           (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F... 
# $ data.showHideButton            (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F... 
# $ data.accountKey             (fctr) Jose_Carbajo2, Daniele_Riccio, Fiona_Togneri2, Sukanya_Patel, Neri_Fattorini, Pham_Thi_Thuy_Van, Kestutis_Sasnauskas, Iris_Weintal, Godelieve_Verhaegen, Janani_Venkatraman2, Ka... 
# $ data.hasInfoPopup            (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F... 
# $ data.hasTeaserPopup            (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR... 
# $ data.widgetId             (fctr) rgw3_5539fc8299ef4, rgw4_5539fc8299ef4, rgw5_5539fc8299ef4, rgw6_5539fc8299ef4, rgw7_5539fc8299ef4, rgw8_5539fc8299ef4, rgw9_5539fc8299ef4, rgw10_5539fc8299ef4, rgw11_5539fc829... 
# $ id                (fctr) rgw3_5539fc8299ef4, rgw4_5539fc8299ef4, rgw5_5539fc8299ef4, rgw6_5539fc8299ef4, rgw7_5539fc8299ef4, rgw8_5539fc8299ef4, rgw9_5539fc8299ef4, rgw10_5539fc8299ef4, rgw11_5539fc829... 
# $ templateName              (fctr) application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, a... 
# $ templateExtensions            (fctr) generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, ... 
# $ widgetUrl              (fctr) http://www.researchgate.net/application.PeopleAccountItem.html?entityId=7508014&imageSize=m&enableFollowButton=1&showHideButton=0&showConnectionButton=0&event=tp_followers_xflw... 
# $ viewClass              (fctr) views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.... 
# $ yuiModules              (fctr) rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleI... 
0

的服务器将数据作为JSON返回,如果您要求的话。随后的调用使用先前的JSON调用提供的偏移参数。在下面的例子中,我刚刚称为前10个偏移量。这相当于向下滚动了10次。有更多的数据,然后只是个人资料的网页链接:

library(RCurl) 
library(XML) 
library(jsonlite) 
# get initial page 
initURL <- "http://www.researchgate.net/topic/biotechnology" 
doc <- htmlParse(initURL) 
noFollowers <- doc["//*/strong/*/a[@class='js-see-all']", fun = xmlValue][[1]] 
noFollowers <- as.integer(gsub("[^0-9]", "", noFollowers)) 

appURL <- "http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000" 
appData <- getURL(appURL 
        , httpheader = c(accept = "application/json")) 
follData <- list(fromJSON(appData)$result$data$content$data$listItems) 
for(i in 1:10){ 
    nextURL <- fromJSON(appData)$result$data$nextOffset 
    appData <- getURL(paste0(appURL, "&offset=", nextURL) 
        , httpheader = c(accept = "application/json")) 
    follData[[i+1]] <- fromJSON(appData)$result$data$content$data$listItems 
} 
followers <- na.omit(do.call(c, lapply(follData, function(x){x$data$url}))) 
> head(followers) 
[1] "profile/Subhashish_Dutta" "profile/Jerome_Wang3"  "profile/Jose_Carbajo2" 
[4] "profile/Daniele_Riccio" "profile/Fiona_Togneri2" "profile/Sukanya_Patel"