Rselenium网页抓取：作为函数应用

我一直在试图解决这一整天，我找不出解决方案。请帮忙！！所以学习网页刮，我一直在练习本网站：Rselenium网页抓取：作为函数应用

https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi

目标是刮每一件产品的价格。所以，感谢这个网站和其他互联网用户在ressources，我做了这个代码工作完美：

option <- remDr$findElement(using = 'xpath', "//*/option[@value = 'view_all']") 
option$clickElement() 
priceNodes <- remDr$findElements(using = 'css selector', ".price") 
price<-unlist(lapply(priceNodes, function(x){x$getElementText()})) 
price<-gsub("€","",price) 
price<-gsub(",","",price) 
price <- as.numeric(price)

本

所以我得到了我想要的结果，这是（204个值的标价）。现在我想将整个过程转换为一个函数，以便将此函数应用于地址列表（在本例中为其他品牌）。很显然它不工作...：

FPrice <- function(x) { 
    url1 <- x 
    remDr <- rD$client 
    remDr$navigate(url1) 
    iframe <- remDr$findElement("css", value=".view-more-less") 
    option <- remDr$findElement(using = 'xpath', "//*/option[@value = 'view_all']") 
    option$clickElement() 
    priceNodes <- remDr$findElements(using = 'css selector', ".price") 
    price<-unlist(lapply(priceNodes, function(x){x$getElementText()})) 
    }

当我申请这样的：

FPrice("https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi")

错误消息过来了，我不明白，我寻找的数据：

Selenium message:stale element reference: element is not attached to the page document 
     (Session info: chrome=61.0.3163.100) 
     (Driver info: chromedriver=2.33.506106 (8a06c39c4582fbfbab6966dbb1c38a9173bfb1a2),platform=Mac OS X 10.12.6 x86_64)

我认为这是因为里面有一个函数...任何人都可以帮我解决问题吗？谢谢。

Ps。随着我做了另一个代码：

Price <- function(x) { 
    url1 <- x 
webpage <- read_html(url1) 
price_data_html <- html_nodes(webpage,".price") 
price_data <- html_text(price_data_html) 
price_data<-gsub("€","",price_data) 
price_data<-gsub(",","",price_data) 
price_data <- as.numeric(price_data) 
return(price_data) 
}

它工作得很好。我甚至将它应用于包含地址列表的矢量。但是，在rvest中，我无法配置浏览器，因此请选择“全部显示”选项。因此，我只能得到60个观察结果，而一些品牌提出超过200个产品，就像Fendi那样。

非常感谢您的耐心等待。希望能尽快给您解读！

来源

2017-10-21 Mouloune van Muzha

令人吃惊的是（我验证了这一点），该网站并未明确防止条款&条件刮他们离开了/fr/fr路径了他们robots.txt排除。即你得到幸运。这可能是他们的疏忽。

但是，有一个非硒方法。主页通过XHR呼叫加载产品<div>，所以通过浏览器开发工具“网络”选项卡检查，您可以逐页或完全删除。下面是需要S：

library(httr) 
library(rvest) 
library(purrr)

对于分页方法，我们建立一个函数：

get_prices_on_page <- function(pg_num = 1) { 

    Sys.sleep(5) # be kind 

    GET(
    url = "https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi", 
    query = list(
     view = "jsp", 
     sale = "0", 
     exclude = TRUE, 
     pn = pg_num, 
     npp=60, 
     image_view = "product", 
     dScroll = "0" 
    ), 
) -> res 

    pg <- content(res, as="parsed") 

    list(
    total_pgs = html_node(pg, "div.data_totalPages") %>% xml_integer(), 
    total_items = html_node(pg, "data_totalItems") %>% xml_integer(), 
    prices_on_page = html_nodes(pg, "span.price") %>% 
     html_text() %>% 
     gsub("[^[:digit:]]", "", .) %>% 
     as.numeric() 
) 

}

然后得到的第一页：

prices <- get_prices_on_page(1)

和，然后遍历，直到我们'做完了，把所有东西放在一起：

c(prices$prices_on_page, map(2:prices$total_pgs, get_prices_on_page) %>% 
    map("prices_on_page") %>% 
    flatten_dbl()) -> all_prices 

all_prices 
## [1] 601 1190 1700 1480 1300 590 950 1590 3200 410 950 595 1100 690 
## [15] 900 780 2200 790 1300 410 1000 1480 750 495 850 850 900 450 
## [29] 1600 1750 2200 750 750 1550 750 850 1900 1190 1200 1650 2500 580 
## [43] 2000 2700 3900 1900 600 1200 650 950 600 800 1100 1200 1000 1100 
## [57] 2500 1000 500 1645 550 1505 850 1505 850 2000 400 790 950 800 
## [71] 500 2000 500 1300 350 550 290 550 450 2700 2200 650 250 200 
## [85] 1700 250 250 300 450 800 800 800 900 600 900 375 5500 6400 
## [99] 1450 3300 2350 1390 2700 1500 1790 2200 3500 3100 1390 1850 5000 1690 
## [113] 2700 4800 3500 6200 3100 1850 1950 3500 1780 2000 1550 1280 3200 1350 
## [127] 2700 1350 1980 3900 1580 18500 1850 1550 1450 1600 1780 1300 1980 1450 
## [141] 1320 1460 850 1650 290 190 520 190 1350 290 850 900 480 450 
## [155] 850 780 1850 750 450 1100 1550 550 495 850 890 850 590 595 
## [169] 650 650 495 595 330 480 400 220 130 130 290 130 250 230 
## [183] 210 900 380 340 430 380 370 390 460 255 300 480 550 410 
## [197] 350 350 280 190 350 550 450 430

或者，我们可以让他们于一身，一举通过使用“所有在一个页面上查看”该功能的网站有：

pg <- read_html("https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi?view=jsp&sale=0&exclude=true&pn=1&npp=view_all&image_view=product&dScroll=0") 
html_nodes(pg, "span.price") %>% 
    html_text() %>% 
    gsub("[^[:digit:]]", "", .) %>% 
    as.numeric() -> all_prices 

all_prices 
# same result as above

请保持抓取延迟，如果你使用的分页方法，并请不要滥用内容。尽管他们不会拒绝刮擦，但它只能用于个人产品选择使用。

来源

2017-10-21 17:03:49 hrbrmstr

Rselenium网页抓取：作为函数应用

回答

相关问题