2017-09-07 28 views
0

我正在使用r中的R获得类别(维基百科页面的底部大部分)。我已经使用SelectorGadget来标识用于类别提取的html节点。我使用的代码如下如何使用Rvest中的R获取Wikipedia中的“Categories”?

thepage <- read_html("https://en.wikipedia.org/wiki/San_Diego") 
Categories <- thepage %>% 
      html_nodes("#mw-normal-catlinks") %>% 
      html_text() 
Categories 

得到的结果如下:

"Categories: San Diego1769 establishments in California1850 establishments in CaliforniaCities in San Diego County, CaliforniaCounty seats in CaliforniaIncorporated cities and towns in CaliforniaPopulated coastal places in CaliforniaPopulated places established in 1769San Antonio-San Diego Mail LineSan Diego County, CaliforniaSan Diego metropolitan areaSpanish mission settlements in North AmericaSpecial economic zones of the United StatesStagecoach stops in the United States" 

正如你可以看到,有没有分隔符的类别区分。第一类是“圣地亚哥”,第二类是“加利福尼亚州的1769个机构”。我如何在列表中获得这些类别或以某种方式分离?

回答

1

每个类别是一个列表项,那么你需要进入名单:

thepage %>% 
    html_nodes(".mw-normal-catlinks ul li") %>% 
    html_text() 

[1] "San Diego"         "1769 establishments in California"   
[3] "1850 establishments in California"   "Cities in San Diego County, California"  
[5] "County seats in California"     "Incorporated cities and towns in California" 
[7] "Populated coastal places in California"  "Populated places established in 1769"   
[9] "San Antonio-San Diego Mail Line"    "San Diego County, California"     
[11] "San Diego metropolitan area"     "Spanish mission settlements in North America" 
[13] "Special economic zones of the United States" "Stagecoach stops in the United States" 
相关问题