2014-10-08 32 views
1

我正在研究R中一个Atom feed的scraper,并且有问题获取每篇文章的链接。这里是我的代码:R将原子馈给数据帧

url <- "http://www.stwnewspress.com/search/?mode=article&q=&nsa=eedition&t=article&l=1000&s=&sd=desc&f=atom&d=&d1=&d2=" 
pageSource <- getURL(url, encoding = "UTF-8") 
parsed <- htmlParse(pageSource) 
titles <- xpathSApply(parsed, '//entry/title', xmlValue) 
authors <- xpathSApply(parsed, '//entry/author', xmlValue) 
links <- xpathSApply(parsed, '//entry/link/@href') 
dataFrame <- data.frame(pubDates, titles, authors) 

我的问题是我捡到18个标题,18个作者和20个链接。我想我正在从Feed页面中挑选前两个链接,但我不知道如何停止接收它们。

感谢您的帮助!

+1

你可以尝试使用[R是RSS](https://github.com/noahhl/r-does-rss)以及ad @ jdharrison的回答 – hrbrmstr 2014-10-08 15:39:30

回答

0

你可以使用“// entry”而不是单个节点。一些入门节点具有例如多个链接:

out <- xpathApply(parsed, "//entry", function(x){ 
    children <- xmlChildren(x) 
    title <- xmlValue(children$title) 
    author <- xmlValue(children$author) 
    links <- children[names(children)%in%"link"] 
    links <- sapply(links, function(y){xmlGetAttr(y, "href")}) 
    data.frame(title, author, links, stringsAsFactors = FALSE) 
}) 

> out[[1]] 
              title   author 
1 Soap opera star in serious injury crash in Ohio CNHI News Service 
2 Soap opera star in serious injury crash in Ohio CNHI News Service 
                                               links 
1                      http://www.stwnewspress.com/cnhi_network/article_71fb99db-0d47-5ead-9276-cae9c947babc.html 
2 http://bloximages.chicago2.vip.townnews.com/stwnewspress.com/content/tncms/assets/v3/editorial/d/97/d97a9815-29c8-5b90-be11-41a3a8b12e9f/54354a7b66bd9.image.jpg?resize=300%2C450 
> out[[2]] 
            title         author 
link Q5: Voter registration deadline nears By Michelle Charles/Stillwater News Press 
                          links 
link http://www.stwnewspress.com/news/local_news/article_ba35bd60-4ea4-11e4-8da8-93d495865336.html 

然后,您可以rbind您的个人条目在一起:

res <- do.call(rbind.data.frame, out) 
> str(res) 
'data.frame': 147 obs. of 3 variables: 
$ title : chr "Soap opera star in serious injury crash in Ohio" "Soap opera star in serious injury crash in Ohio" "Q5: Voter registration deadline nears" "Oklahoma State assault under investigation" ... 
$ author: chr "CNHI News Service" "CNHI News Service" "By Michelle Charles/Stillwater News Press" "By Megan Sando/Stillwater News Press" ... 
$ links : chr "http://www.stwnewspress.com/cnhi_network/article_71fb99db-0d47-5ead-9276-cae9c947babc.html" "http://bloximages.chicago2.vip.townnews.com/stwnewspress.com/content/tncms/assets/v3/editorial/d/97/d97a9815-29c8-5b90-be11-41a"| __truncated__ "http://www.stwnewspress.com/news/local_news/article_ba35bd60-4ea4-11e4-8da8-93d495865336.html" "http://www.stwnewspress.com/news/local_news/article_7023a110-4ea4-11e4-82dd-f735d5c5ed44.html" ... 

要了解函数的作品怎么看的第一项称之为x

url <- "http://www.stwnewspress.com/search/?mode=article&q=&nsa=eedition&t=article&l=1000&s=&sd=desc&f=atom&d=&d1=&d2=" 
pageSource <- getURL(url, encoding = "UTF-8") 
parsed <- htmlParse(pageSource) 
x <- parsed["//entry"][[1]] 
children <- xmlChildren(x) 

> names(children) 
[1] "title" "author" "link"  "id"  "content" "category" 
[7] "updated" 

> children$title 
<title>BYRON YORK: Jindal a GOP darkhorse in 2016 race</title> 

> xmlValue(children$title) 
[1] "BYRON YORK: Jindal a GOP darkhorse in 2016 race" 
+0

谢谢,这很有用。不幸的是,我不完全明白它是如何工作的。你能否详细说明该功能的工作原理? – 2014-10-08 16:08:17

+0

我已经添加了一个关于它是如何工作的解释。由xpath'// entry'给出的每个节点都由该函数处理。您可以看到如何处理第一个节点的内容。 – jdharrison 2014-10-08 16:18:23