在R中提取网址和标题

我很难从网站的源代码中提取特定的文本选择。我可以提取整个列表，但我只需要一个国家，例如阿根廷。在R中提取网址和标题

的源代码是：

<div class="article-content"> 
            <div class="RichTextElement"> 
             <div><h3 style="background-color: transparent; color: rgb(51, 51, 51);"><span style="font-weight: normal; font-family: Verdana;">Afghanistan - </span><span style="background-color: transparent; font-weight: normal; font-family: Verdana;"><a title="Tax Authority in Afganistan" href="http://mof.gov.af/en" style="background-color: transparent; color: rgb(51, 51, 51);">Ministry of Finance</a><br />Argentina - <a title="Tax Authority in Argentina" href="http://www.afip.gob.ar/english/" style="background-color: transparent; color: rgb(51, 51, 51);">Federal Administration of Public Revenues</a><br />

我只需要 “联邦行政机构公共收入” 和 “http://www.afip.gob.ar/english/”

到目前为止，我有：

argurl <- readLines("http://oceantax.co.uk/links/tax-authorities-worldwide.html") 

strong <-as.matrix(grep("<br//>",argurl)) 
strong1starts <- grep("<br //>Argentina",argurl) 
rowst1st <- which(grepl(strong1starts, strong)) 
strong1ends <- strong[rowst1st + 1 ,]-1 
data1 <- as.matrix(argurl[strong1starts:strong1ends])

来源

2015-02-24 grant macdonald

[唐'使用正则表达式来解析HTML]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-containe d-tags）：相反，请查看[Rvest]（https://github.com/hadley/rvest）包中解析R中的HTML – 2015-02-24 18:34:39

library(rvest) 

url <- "http://oceantax.co.uk/links/tax-authorities-worldwide.html" 
pg <- html(url) 

# get the country node 

# XPath version 
country <- pg %>% html_nodes(xpath="//a[contains(@title, 'Argentina')]") 

# CSS Selector version 
country <- pg %>% html_nodes("a[title~=Argentina]") 

# use one of the above then: 

country %>% html_text()  # get the text of the anchor 
country %>% html_attr("href") # get the URL of the anchor

来源

2015-02-24 18:38:20 hrbrmstr

在R中提取网址和标题

回答

相关问题