如何使用rvest（）获取表格

我想使用rvest软件包从Pro Football Reference网站获取一些数据。首先，让我们抓住从这个网址http://www.pro-football-reference.com/years/2015/games.htm如何使用rvest（）获取表格

library("rvest") 
library("dplyr") 

#grab table info 
url <- "http://www.pro-football-reference.com/years/2015/games.htm" 
urlHtml <- url %>% read_html() 
dat <- urlHtml %>% html_table(header=TRUE) %>% .[[1]] %>% as_data_frame()

在2015年玩过的所有游戏的结果是这样，你怎么会做呢？ :)

dat可能会被清理一下。其中两个变量似乎对姓名有空白。另外标题行在每周之间重复。

colnames(dat) <- c("week", "day", "date", "winner", "at", "loser", 
        "box", "ptsW", "ptsL", "ydsW", "toW", "ydsL", "toL") 

dat2 <- dat %>% filter(!(box == "")) 
head(dat2)

看起来不错！

现在让我们来看一个单独的游戏。在上面的网页上，点击表格第一行的“Boxscore”：9月10日比赛在新英格兰和匹兹堡之间进行。这需要我们在这里：http://www.pro-football-reference.com/boxscores/201509100nwe.htm。

我想抓住每个玩家的个别对齐计数（大约在页面中间的一半）。很确定这些将是我们的前两行代码：

gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" 
gameHtml <- gameUrl %>% read_html()

但现在我无法弄清楚如何抓住我想要的特定表。我使用Selector Gadget来突出显示Patriots snap计数表。我通过点击几个地方的表格来做到这一点，然后'取消'突出显示的其他表格。我最终的路径：

这些尝试

#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left

每个返回{xml_nodeset (0)}

gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left") 
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left") 
gameHtml %>% html_nodes("#home_snap_counts .right") 
gameHtml %>% html_nodes("#home_snap_counts")

也许让我们尝试使用xpath。所有这些尝试也将返回{xml_nodeset (0)}

gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "), concat(" ", "right", " "))] | //*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "), concat(" ", "left", " "))]//*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "), concat(" ", "left", " "))]//*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "), concat(" ", "tooltip", " "))]//*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "), concat(" ", "left", " "))]') 
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "))]') 
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]')

我该如何抓取该表？我还会指出，当我在Google Chrome浏览器中查看页面源代码时，我想要的表格几乎似乎已被注释掉了。也就是说，它们以绿色打印，而不是通常的红色/黑色/蓝色配色方案。我们先抽出的比赛结果并非如此。该表格的“查看页面源代码”是通常的红/黑/蓝颜色方案。绿色是否代表什么阻止了我能够抓住这个快照表？

谢谢！

来源

2016-08-30 hossibley

'网址< - “http://www.pro-football-reference.com/boxscores/201509100nwe.htm#all_vis_snap_counts” 单元。计数<- url %>％ read_html（）％>％ html_nodes（xpath ='// * [contains（concat（“”，@class，“”），concat（“”，“table_container”，“”））]'） ''返回一个元素（即''{xml_nodeset（1）}''）列表，但我似乎不能将它转换为使用html_table（fill = TRUE）的表格'' –

''http：// www .pro-football-reference.com/boxscores/201509100nwe.htm'％>％read_html（）％>％html_nodes（xpath ='// comment（）'）％>％html_text（）％>％paste（collapse =' '）％>％read_html（）％>％html_node（'table＃home_snap_counts'）％>％html_table（）％>％{setNames（。[ - 1，]，paste0（names（。），。[1，] ））}％>％readr :: type_convert（）' – alistaire

您正在查找的信息在运行时以编程方式显示。一种解决方案是使用RSelenium。
查看网页的源代码时，表中的信息存储在代码中，但隐藏是因为表存储为注释。这里是我的解决方案，我删除评论标记并正常重新处理页面。

我将文件保存到工作目录，然后使用readLines函数读取文件。现在我搜索html开始和结束注释标志，然后删除它们。我再次保存该文件（少于注释标记）以重新读取和处理选定表的文件。

gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" 
gameHtml <- gameUrl %>% read_html() 
gameHtml %>% html_nodes("tbody") 

#Only save and work with the body 
body<-html_node(gameHtml,"body") 
write_xml(body, "nfl.xml") 

#Find and remove comments 
lines<-readLines("nfl.xml") 
lines<-lines[-grep("<!--", lines)] 
lines<-lines[-grep("-->", lines)] 
writeLines(lines, "nfl2.xml") 

#Read the file back in and process normally 
body<-read_html("nfl2.xml") 
html_table(html_nodes(body, "table")[29]) 

#extract the attributes and find the attribute of interest 
a<-html_attrs(html_nodes(body, "table")) 

#find the tables of interest. 
homesnap<-which(sapply(a, function(x){x[2]})=="home_snap_counts") 
html_table(html_nodes(body, "table")[homesnap]) 

visitsnap<-which(sapply(a, function(x){x[2]})=="vis_snap_counts") 
html_table(html_nodes(body, "table")[visitsnap])

来源

2016-08-30 23:22:35 Dave2e

谢谢戴夫！很好的解决方案。 – hossibley

如何使用rvest（）获取表格

回答

相关问题