2017-03-06 110 views
0

我不知道为什么会出现这个错误?我试图使用XMLPARSE函数来解析像标题,链接,描述,日期&将它保存在数据帧格式的新闻内容,但它抛出错误,如...我不能分析此新闻内容

site = "http://www.federalreserve.gov/feeds/prates.xml" 
doc <- tryCatch(xmlParse(site), error=function(e) e);  
Unknown IO errorfailed to load external entity  
"http://www.federalreserve.gov/feeds/prates.xml" 
src <- xpathApply(xmlRoot(doc), "//item") 
Error in UseMethod("xmlRoot") :no applicable method for 'xmlRoot'applied to an object of class "c('XMLParserErrorList', 'simpleError', 'error',  
'condition')" 
for (i in 1:length(src)) { 
if (i==1) { 
     foo<-xmlSApply(src[[i]], xmlValue) 
     temp<-data.frame(t(foo), stringsAsFactors=FALSE) 
     DATA=data.frame(title=temp$title,link=temp$link,description=temp$description,pubDate=temp$pubDate) 
    } 
    else { 
     foo<-xmlSApply(src[[i]], xmlValue) 
     temp<-data.frame(t(foo), stringsAsFactors=FALSE) 
     temp1=data.frame(title=temp$title,link=temp$link,description=temp$description,pubDate=temp$pubDate) 
     DATA<-rbind(DATA, temp1) 
    } 
} 
Error: object 'src' not found 
+0

您应该将XML对象传递给'xmlParse',而不是URL。 –

+0

该网站现在是https:// –

+0

@chris这并不重要...我解析XML文件。 –

回答

0

这意味着错误的URL重定向到HTTPS在我的评论中提到...

site   <- "http://www.federalreserve.gov/feeds/prates.xml" 
correct_site <- "https://www.federalreserve.gov/feeds/prates.xml" 

curlGetHeaders(site) 
[1] "HTTP/1.1 301 Moved Permanently\r\n"                           
[2] "Location: https://www.federalreserve.gov/feeds/prates.xml\r\n"                    
...  

xmlParse(site) 
Unknown IO errorfailed to load external entity "http://www.federalreserve.gov/feeds/prates.xml" 

xmlParse无法从https阅读,所以使用readlines方法(忽略警告)或xml2包或许多其他选项从安全HTTP读取。

xmlParse(correct_site) 
Error: XML content does not seem to be XML: 'https://www.federalreserve.gov/feeds/prates.xml' 

x <- readLines(correct_site) 
Warning message: 
In readLines(correct_site) : 
    incomplete final line found on 'https://www.federalreserve.gov/feeds/prates.xml' 


xmlParse(x) 
<?xml version="1.0" encoding="utf-8"?> 
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:cb="http://www.cbwiki.net/wiki/index.php/Specification_1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/02/22-rdf-syntax-ns# rdf.xsd"> 
    <channel rdf:about="http://www.federalreserve.gov/feeds/"> 
    <title>FRB: DDP: Policy Rates</title> 
... 

library(xml2) 
read_xml(correct_site) 

{xml_document} 
<RDF schemaLocation="http://www.w3.org/1999/02/22-rdf-syntax-ns# rdf.xsd" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:cb="http://www.cbwiki.net/wiki/index.php/Specification_1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
[1] <channel rdf:about="http://www.federalreserve.gov/feeds/">\n <title>FRB: DDP: Policy Rates</title>\n ... 
[2] <item rdf:about="http://www.federalreserve.gov/feeds/PRATES.html#1765">\n <title>Change to the Publica ... 
[3] <item rdf:about="http://www.federalreserve.gov/feeds/PRATES.html#953">\n <title>Change to the Payment . 
+0

哇!这是很酷的技巧readLines适用于https非常感谢@Chris –