使用R从搜索结果URL中提取文本

我知道R有点但不是专业人士。我正在研究一个使用R的文本挖掘项目。使用R从搜索结果URL中提取文本

我用美国联邦储备委员会的网站搜索了一个关键字，说'通货膨胀'。搜索结果的第二页有URL：（https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation）。

此页面共有10个搜索结果（10个URL）。我想在R中编写一个代码，它将读取与这10个URL中的每一个对应的页面，并将这些网页中的文本提取为.txt文件。我唯一的意见就是上面提到的URL。

我感谢您的帮助。如果有任何类似的旧帖子，也请参考我。谢谢。

来源

2017-08-27 SBAG009

这是如何去废除这个页面的基本思想。尽管如果有许多页面需要报废，它可能会很慢。现在你的问题有点模糊。您希望最终结果为.txt文件。什么是具有pdf的网页？好的。您仍然可以使用此代码并将文件扩展名更改为pdf，以获得包含pdf的网页。

library(xml2) 
library(rvest) 

urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation" 

    urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>% 
     .[!duplicated(.)]%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>% 
     Map(function(x,y) write_html(x,tempfile(y,fileext=".txt"),options="format"),., 
      c(paste("tmp",1:length(.))))

这是代码的上面的故障：的网址要废钢：

urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

获取所有的网址，您需要：

allurls <- urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%.[!duplicated(.)]

你想保存你的文本在哪里？创建临时文件：

tmps <- tempfile(c(paste("tmp",1:length(allurls))),fileext=".txt")

按照现在。你的allurls是在课堂上的角色。您必须将其更改为xml才能删除它们。然后最后把它们写入上面创建的tmp文件中：

allurls%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>% 
     Map(function(x,y) write_html(x,y,options="format"),.,tmps)

请不要遗漏任何东西。例如在..."format"),之后有一段时间。考虑到这一点。现在您的文件已被写入tempdir。要确定它们的位置，只需在控制台上键入命令tempdir()，它应该给你文件的位置。同时，您可以在tempfile命令中更改报废文件的位置。

希望这会有所帮助。

来源

2017-08-28 00:34:48 Onyambu

非常感谢，Onyambu！非常有用的答案！再次感谢！ – SBAG009

你在这里。对于主要搜索页面，您可以使用正则表达式，因为URL可以在源代码中轻松识别。

（与https://statistics.berkeley.edu/computing/r-reading-webpages帮助）

library('RCurl') 
library('stringr') 
library('XML') 

pageToRead <- readLines('https://search.newyorkfed.org/board_public/search? 
start=10&Search=&number=10&text=inflation') 
urlPattern <- 'URL: <a href="(.+)">' 
urlLines <- grep(urlPattern, pageToRead, value=TRUE) 

getexpr <- function(s,g)substring(s, g, g + attr(g, 'match.length') - 1) 
gg <- gregexpr(urlPattern, urlLines) 
matches <- mapply(getexpr, urlLines, gg) 
result = gsub(urlPattern,'\\1', matches) 
names(result) = NULL 


for (i in 1:length(result)) { 
    subURL <- result[i] 

    if (str_sub(subURL, -4, -1) == ".htm") { 
    content <- readLines(subURL) 
    doc <- htmlParse(content, asText=TRUE) 
    doc <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue) 
    writeLines(doc, paste("inflationText_", i, ".txt", sep="")) 

    } 
}

然而，正如你可能已经注意到，这仅解析与.htm页，对于在搜索结果链接的.pdf文档，我劝你去看看那里：http://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/

来源

2017-08-27 22:24:51

非常感谢你，文森特。这非常有用，帮助我很多！ – SBAG009

使用R从搜索结果URL中提取文本

回答

相关问题