从维基百科加载表到R

我想从以下URL中将最高法院法官表加载到R中。 https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States 从维基百科加载表到R

我使用以下代码：

scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States" 
scotusData <- getURL(scotusURL, ssl.verifypeer = FALSE) 
scotusDoc <- htmlParse(scotusData) 
scotusData <- scotusDoc['//table[@class="wikitable"]'] 
scotusTable <- readHTMLTable(scotusData[[1]], stringsAsFactors = FALSE)

ř返回scotusTable为NULL。这里的目标是在R中获得一个data.frame，我可以用它来构建一个在法庭上享有SCOTUS正义任期的ggplot。我以前有过这样的脚本来制作一个很棒的情节，但是最近的决定在页面上发生了一些变化，现在脚本无法运行。我通过维基百科上的HTML尝试查找任何更改，但是我不是webdev，因此任何会破坏我的脚本的内容都不会立即显现。

此外，R中是否有一个方法可以缓存来自此页面的数据，因此我并不是经常引用该URL？这似乎是今后避免这个问题的理想方式。欣赏帮助。

另外，SCOTUS在我的正在进行的业余爱好/副项目中，所以如果还有其他的数据源比维基百科更好的话，那么我就是耳熟能详。

编辑：对不起，我应该列出我的依赖。我正在使用XML，plyr，RCurl，data.table和ggplot2库。

来源

2015-07-02 Benjamin Scott

什么是'getURL'函数的源代码？ – Frash

http://stackoverflow.com/questions/27843659/scraping-a-complex-html-table-into-a-data-frame-in-r – Khashaa

关于你的问题，你可以考虑在开放的数据堆栈交换站点上询问。 – Frank

如果您不介意使用不同的包装，您可以尝试“rvest”包装。

library(rvest)  
scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"

选项1：抓住从页面的表格和使用html_table函数提取你感兴趣的表

temp <- scotusURL %>% 
    html %>% 
    html_nodes("table") 

html_table(temp[1]) ## Just the "legend" table 
html_table(temp[2]) ## The table you're interested in

选项2：检查表元素复制XPath以直接读取该表（右键单击，检查元素，滚动到相关的“表”标记，右键单击该表并选择“复制XPath”）。
```
scotusURL %>% 
    html %>% 
    html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>% 
    html_table 
```

另一种选择我喜欢的是加载在谷歌电子表格中的数据，并使用"googlesheets" package阅读它。

在Google Drive中，创建一个名为“最高法院”的新电子表格。在第一个工作表中，输入：

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

这会自动将此表格拖到Google电子表格中。

从那里，R中，你可以做：

library(googlesheets) 
SC <- gs_title("Supreme Court") 
gs_read(SC)

来源

2015-07-02 06:32:21 A5C1D2H2I1M1N2O1R2T1

'temp = tempfile（）; httr :: GET（wurl，user_agent（“Dogzilla”），write_disk（temp））;表< - XML :: readHTMLTable（temp）;表[[2]]; '给了我和上面代码一样的表格，但是你怎么清理这些年份等等。这些都是混乱的。就像出生/死于第一行一样，出现在174512121745-1829之间，而实际上却是1745-1829。不知道多余角色的来源。 – Frash

@Frash，我不知道这是怎么发生的，但它似乎是嵌入最后一年的确切日期（12/12/1745）。 – A5C1D2H2I1M1N2O1R2T1

你是对的，wiki页面以编辑模式显示该日期。 – Frash

你可以试试这个：

url <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States" 
library(rvest) #v 0.2.0.9000 
the_table <- read_html(url) %>% html_node("table.wikitable:nth-child(11)") %>% html_table()

来源

2015-07-02 06:33:41 RHertel

如果您有'rvest'包的旧版本，则可能需要将'read_html（url）'替换为'html（url ）'。 – RHertel

出于某种原因，googlesheets依赖是行不通的，所以我把它通过谷歌反正。

我跑：

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

，然后下载的文件为.csv

不知道为什么我之前没想到的。我将不得不重新编写我的字符串脚本来清理它，但这最终成为1）解决我遇到的第一个问题和2）下载文件的最佳方法，以便我不必继续引用URL 。

感谢您的帮助。

来源

2015-07-02 08:47:01

我会删除所有<span style="display:none">节点并从scotusDoc中读取表，而不是尝试选择已更改的表类值。

scotusDoc <- htmlParse(scotusData, encoding="UTF-8") 
xpathSApply(scotusDoc, "//span[@style='display:none']", removeNodes) 
x <- readHTMLTable(scotusDoc, which=2,stringsAsFactors=FALSE) 

head(x) 
    #   Judge State Born/Died   Active service Chief Justice Retirement Appointed by Reason for\ntermination 
1 1  John Jay† NY 1745–1829 1789–1795(5–6 years)  1789–1795   — Washington    Resignation 
2 2 John Rutledge SC 1739–1800 1789–1791(1–2 years)    —   — Washington  Resignation[n 1] 
3 3 William Cushing MA 1732–1810 1789–1810(20–21 years)    —   — Washington     Death 
4 4 James Wilson PA 1742–1798 1789–1798(8–9 years)    —   — Washington     Death 
5 5 John Blair, Jr. VA 1732–1800 1789–1795(5–6 years)    —   — Washington    Resignation 
6 6 James Iredell NC 1751–1799 1790–1799(8–9 years)    —   — Washington     Death

这里是表类，所以第二台现在是一个“wikitable排序”

xpathSApply(scotusDoc, "//table", xmlGetAttr, "class") 
[1] "wikitable"           "wikitable sortable"        
[3] "navbox"           "nowraplinks collapsible autocollapse navbox-inner" 
[5] "navbox"           "nowraplinks collapsible collapsed navbox-inner

来源

2015-07-02 16:26:32

从维基百科加载表到R

回答

相关问题