2013-10-29 69 views
0

我使用DOM解析器读取RSS的HTML标签饲料像这样的机器人中:解析XML,没有字符串返回如果CDATA不包含

... 
<item cbc:type="story" cbc:deptid="2.663" cbc:syndicate="true"> 
<title> 
<![CDATA[ 
Asian carp have reproduced in Great Lakes watershed 
]]> 
</title> 
<link> 
http://www.cbc.ca/news/canada/windsor/asian-carp-have-reproduced-in-great-lakes-watershed-1.2286554?cmp=rss 
</link> 
<guid isPermaLink="false">1.2286554</guid> 
<pubDate>Tue, 29 Oct 2013 08:06:48 EDT</pubDate> 
<description> 
<![CDATA[ 
<img title='Fisheries and Oceans Canada and the Ontario Ministry of Natural Resources confirmed one grass carp was caught in the Grand River near Lake Erie. ' height='259' alt='hi-20130502-grass_carp-dfo-852' width='460' src='http://i.cbc.ca/1.1663916.1379078358!/httpImage/image.jpg_gen/derivatives/16x9_460/hi-20130502-grass-carp-dfo-852.jpg' /> <p>Scientists said Monday they have documented for the first time that an Asian carp species has successfully reproduced within the Great Lakes watershed, an ominous development in the struggle to slam the door on the hungry invaders that could threaten native fish.</p> 
]]> 
</description> 
</item> 
... 

xmlParser.class:

public class xmlParser { 

public Document getDomElement(String rssFilePath, String fileName){ 
    Log.d("GET", ""+rssFilePath+fileName); 
    Document doc = null; 
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); 
    dbf.setCoalescing(true); 
    FileInputStream fis; 
    try { 

     DocumentBuilder db = dbf.newDocumentBuilder(); 

     File tmp2 = new File (rssFilePath,"/"+ fileName); 
     fis = new FileInputStream(tmp2); 

     InputSource is = new InputSource(); 
      is.setByteStream(fis); 
      doc = db.parse(is); 
     } catch (ParserConfigurationException e) { 
      Log.e("Error: ", e.getMessage()); 
      return null; 
     } catch (SAXException e) { 
      Log.e("Error: ", e.getMessage()); 
      return null; 
     } catch (IOException e) { 
      Log.e("Error: ", e.getMessage()); 
      return null; 
     } 
      // return DOM 
    // Log.d("DOM", doc.toString()); 
     return doc; 

} 

public String getValue(Element item, String str) { 
    NodeList n = item.getElementsByTagName(str);   
    return this.getElementValue(n.item(0)); 
} 

public final String getElementValue(Node elem) { 
     Node child; 
     if(elem != null){ 
      if (elem.hasChildNodes()){ 
       for(child = elem.getFirstChild(); child != null; child = child.getNextSibling()){ 
        if(child.getNodeType() == Node.TEXT_NODE ){ 
         return child.getNodeValue(); 
        } 
       } 
      } 
     } 
     return ""; 
    } 
} 

从我的主要活动:

//Parse the XML content 
      xmlParser parser = new xmlParser(); 
      Log.d(TAG, "1"); 
      Document rssDoc = parser.getDomElement(rssFilePath, rssFileName); 
      Log.d(TAG, "2"); 
      final NodeList nl = rssDoc.getElementsByTagName(KEY_ITEM); 
      Log.d(TAG, "3"); 

      //Make it all look nice and strip HTML 
      for (int i = 0; i < nl.getLength(); i++){ 

       Element e = (Element) nl.item(i); 

       String noHtmlTitle = parser.getValue(e, KEY_TITLE).toString().replaceAll("\\<.*?>", ""); 
       noHtmlTitle = noHtmlTitle.replaceAll("/n", ""); 

       noHtmlTitle = noHtmlTitle.trim(); 

       titles.add(noHtmlTitle); 

       String noHtmlDesc = parser.getValue(e, KEY_DESC).toString().replaceAll("\\<.*?>", ""); 
       noHtmlDesc = noHtmlDesc.trim(); 
       descs.add("\n" + noHtmlDesc); 

      } 

然而,当这个代码呈现上述“标题”“/标题”标签,它retur ns一个空白字符串。这似乎与“标题”标签不包含任何HTML标签的事实有关。

如何从标题标签中检索可用的字符串?

让我知道我是否排除了任何所需的数据。

编辑:

作为每blahdiblah,正在返回的数据类型是CDATA_SECTION_NODE。我修改了getElementValue方法以包括此数据类型:

为文本节点( child.getNodeType() == Node.TEXT_NODE
public final String getElementValue(Node elem) { 
     Node child; 
     if(elem != null){ 
      if (elem.hasChildNodes()){ 
       for(child = elem.getFirstChild(); child != null; child = child.getNextSibling()){ 
        if(child.getNodeType() == Node.TEXT_NODE ){ 
         return child.getNodeValue(); 
        }else if (child.getNodeType() == Node.CDATA_SECTION_NODE){ 
         return child.getNodeValue(); 
        } 
       } 
      } 
     } 
     return ""; 
    } 
+0

'replaceAll'前的'parser.getValue(e,KEY_TITLE).toString()'的值是什么? – blahdiblah

+0

它返回空白。 – keag

+1

它看起来像'getElementValue'只返回一个类型为'TEXT_NODE'节点的值,我猜''有类型'CDATA_SECTION_NODE'。尝试让您的XMLParser对于返回数据更加自由一些。 – <span class="text-secondary"> <small> <a rel="noopener" target="_blank" href="https://stackoverflow.com/users/85950/">blahdiblah</a></span> <span></span> </small> </span> </p> </div> </div> </div> </div> </div> </article> </div> <div class="answer-title"> <span class="text-logo margin-top-sm">A</span> <h2 class="title h4">回答</h2> </div> <div class="item-description text-md markdown-body margin-bottom-40 voidso"> <article class="board-top-1 padding-top-10"> <div class="post-col vote-info"> <span class="count">2<i class="fa fa-thumbs-up"></i></span> <i class="fa fa-check fa-2x"></i> </div> <div class="post-offset"> <div class="answer fmt"> <p>你XMLParser的仅返回的内容,但是<code class="prettyprint-override"><title></code><code class="prettyprint-override">CDATA_SECTION_NODE</code>类型。</p> <p>请注意,标题几乎肯定是作为CDATA而不是纯文本发送的,以便它可以包含HTML格式和其他奇怪字符。确保使用各种输入进行测试,以确保正确解析它。</p> </div> <div class="post-info"> <div class="post-meta row"> <p class="text-secondary col-lg-6"> <span class="source"> <a rel="noopener" target="_blank" href="https://stackoverflow.com/q/19667527">来源</a> </span> </p> <p class="text-secondary col-lg-6"> <span class="float-right date"> <span>2013-10-29 19:33:24</span> <a rel="noopener" target="_blank" href="https://stackoverflow.com/users/85950/">blahdiblah</a></span> </p> <p class="col-12"></p> <p class="col-12"></p></div> </div> </div> </article> <div> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-6208739752673518" data-ad-slot="1038284119" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> </div> <div class="clearfix"> </div> <div class="relative-box"> <div class="relative">相关问题</div> <ul class="relative_list"> <li> 1. <a href="http://www.uwenku.com/question/p-wdbndzqt-rt.html" target="_blank" title="解析查询包含字符串返回无效响应"> 解析查询包含字符串返回无效响应 </a> </li> <li> 2. <a href="http://www.uwenku.com/question/p-hyrarlhq-hd.html" target="_blank" title="有返回包含搜索字符串"> 有返回包含搜索字符串 </a> </li> <li> 3. <a href="http://www.uwenku.com/question/p-ucprgeov-vq.html" target="_blank" title="返回子字符串,如果包含字符"> 返回子字符串,如果包含字符 </a> </li> <li> 4. <a href="http://www.uwenku.com/question/p-gtufouap-xt.html" target="_blank" title="XML解析 - CDATA"> XML解析 - CDATA </a> </li> <li> 5. <a href="http://www.uwenku.com/question/p-gozgitan-ye.html" target="_blank" title="解析器换行符如果字符串包含–"> 解析器换行符如果字符串包含– </a> </li> <li> 6. <a href="http://www.uwenku.com/question/p-zvmuxwnl-bbr.html" target="_blank" title="解析包含字符引用的xml"> 解析包含字符引用的xml </a> </li> <li> 7. <a href="http://www.uwenku.com/question/p-vuvpwpau-bp.html" target="_blank" title="LINQ应该返回没有记录,如果其中包含空字符串"> LINQ应该返回没有记录,如果其中包含空字符串 </a> </li> <li> 8. <a href="http://www.uwenku.com/question/p-bimxpwmb-beo.html" target="_blank" title="返回字符串,如果它包含特定的字"> 返回字符串,如果它包含特定的字 </a> </li> <li> 9. <a href="http://www.uwenku.com/question/p-crepjdzt-qw.html" target="_blank" title="如果字符串包含一个字母,返回整个字符串"> 如果字符串包含一个字母,返回整个字符串 </a> </li> <li> 10. <a href="http://www.uwenku.com/question/p-exfbpdjx-mh.html" target="_blank" title="WCF如何返回包装在cdata中的字符串?"> WCF如何返回包装在cdata中的字符串? </a> </li> <div> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-6208739752673518" data-ad-slot="4606349252"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <li> 11. <a href="http://www.uwenku.com/question/p-mtfiebhh-bdt.html" target="_blank" title="解析包含数组的字符串"> 解析包含数组的字符串 </a> </li> <li> 12. <a href="http://www.uwenku.com/question/p-brbwxnpo-gb.html" target="_blank" title="解析包含字符串backslahes"> 解析包含字符串backslahes </a> </li> <li> 13. <a href="http://www.uwenku.com/question/p-rawrmqut-pn.html" target="_blank" title="如何从搜索字符串中返回包含撇号的查询结果字符串不包含撇号"> 如何从搜索字符串中返回包含撇号的查询结果字符串不包含撇号 </a> </li> <li> 14. <a href="http://www.uwenku.com/question/p-zzuzgkiy-on.html" target="_blank" title="如果字符串包含'"> 如果字符串包含' </a> </li> <li> 15. <a href="http://www.uwenku.com/question/p-brtoqmfn-yw.html" target="_blank" title="解析XML CDATA块"> 解析XML CDATA块 </a> </li> <li> 16. <a href="http://www.uwenku.com/question/p-erbbillx-bn.html" target="_blank" title="PHP XML CDATA解析"> PHP XML CDATA解析 </a> </li> <li> 17. <a href="http://www.uwenku.com/question/p-amztycox-ok.html" target="_blank" title="如果字符串包含不在RegEx中的字符,则返回Javascript RegEx"> 如果字符串包含不在RegEx中的字符,则返回Javascript RegEx </a> </li> <li> 18. <a href="http://www.uwenku.com/question/p-nqwmknzq-rs.html" target="_blank" title="如果字符串包含其他,如果它不包含 - Javascript"> 如果字符串包含其他,如果它不包含 - Javascript </a> </li> <li> 19. <a href="http://www.uwenku.com/question/p-olsfzwvb-bkp.html" target="_blank" title="如果列包含特定字符串,则返回标题"> 如果列包含特定字符串,则返回标题 </a> </li> <li> 20. <a href="http://www.uwenku.com/question/p-euavumqx-cw.html" target="_blank" title="Java,返回如果List包含字符串"> Java,返回如果List包含字符串 </a> </li> <li> 21. <a href="http://www.uwenku.com/question/p-pcfosema-boe.html" target="_blank" title="如果日期包含'1900',则返回空字符串"> 如果日期包含'1900',则返回空字符串 </a> </li> <li> 22. <a href="http://www.uwenku.com/question/p-modmwdvf-hb.html" target="_blank" title="解析包含超链接的xml字符串"> 解析包含超链接的xml字符串 </a> </li> <li> 23. <a href="http://www.uwenku.com/question/p-slszznhl-qq.html" target="_blank" title="xml解析包含空格的单个字符串"> xml解析包含空格的单个字符串 </a> </li> <li> 24. <a href="http://www.uwenku.com/question/p-mytmnqrb-c.html" target="_blank" title="C#XPathDocument将字符串解析为包含BOM的XML"> C#XPathDocument将字符串解析为包含BOM的XML </a> </li> <li> 25. <a href="http://www.uwenku.com/question/p-yznkgmoj-ts.html" target="_blank" title="XML解析器剪切包含口音的字符串"> XML解析器剪切包含口音的字符串 </a> </li> <li> 26. <a href="http://www.uwenku.com/question/p-xwzqszyv-sz.html" target="_blank" title="XML格式的PHP字符串如果缩进则返回解析错误?"> XML格式的PHP字符串如果缩进则返回解析错误? </a> </li> <li> 27. <a href="http://www.uwenku.com/question/p-nchiblmd-kc.html" target="_blank" title="如果字符串不包含在python"> 如果字符串不包含在python </a> </li> <li> 28. <a href="http://www.uwenku.com/question/p-qkctcdnc-bac.html" target="_blank" title="在cdata中包含xml中的所有特殊字符"> 在cdata中包含xml中的所有特殊字符 </a> </li> <li> 29. <a href="http://www.uwenku.com/question/p-zpoxnkhv-bgo.html" target="_blank" title="如何解析包含#字符的字符串?"> 如何解析包含#字符的字符串? </a> </li> <li> 30. <a href="http://www.uwenku.com/question/p-rgxbzsbx-bkq.html" target="_blank" title="如何解析包含特殊字符的JSON字符串?"> 如何解析包含特殊字符的JSON字符串? </a> </li> </ul> </div> <div> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <ins class="adsbygoogle" style="display:block" data-ad-format="autorelaxed" data-ad-client="ca-pub-6208739752673518" data-ad-slot="1575177025"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div class="padding-top-10"></div> </div> </div> <script type="text/javascript" src="http://img.uwenku.com/uwenku/script/side.js?t=1644592048176"></script> <script type="text/javascript" src="http://img.uwenku.com/uwenku/plugin/highlight/highlight.pack.js"></script> <link href="http://img.uwenku.com/uwenku/plugin/highlight/styles/docco.css" media="screen" rel="stylesheet" type="text/css" /> <script type="text/javascript"> $('pre').each(function(i, e) { hljs.highlightBlock(e, "<span class='indent'> </span>", false) }); </script> <div class="col-lg-3 col-md-4 col-sm-5"> <div id="rightTop"> <div class="row sidebar panel panel-default"> <div class="panel-heading font-bold"> 每日一句 </div> <div class="panel-body m-b-sm m-t-sm clearfix"> 每一个你不满意的现在,都有一个你没有努力的曾经。 </div> </div> <div class="row"> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-6208739752673518" data-ad-slot="5415218910" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div class="row sidebar panel panel-default"> <div class="panel-heading font-bold"> 最新问题 </div> <div class="m-b-sm m-t-sm clearfix"> <ul class="side_article_list"> <li class="side_article_list_item"> 1. <a href="http://www.uwenku.com/question/p-umokuomz-yc.html" target="_blank" title="为什么表单值对象为空?"> 为什么表单值对象为空? </a> </li> <li class="side_article_list_item"> 2. <a href="http://www.uwenku.com/question/p-vqjrqojr-vx.html" target="_blank" title="如何在我的MySQL使用GROUP_CONCAT"> 如何在我的MySQL使用GROUP_CONCAT </a> </li> <li class="side_article_list_item"> 3. <a href="http://www.uwenku.com/question/p-uiigjrgk-wd.html" target="_blank" title="部署到蔚蓝的网站"> 部署到蔚蓝的网站 </a> </li> <li class="side_article_list_item"> 4. <a href="http://www.uwenku.com/question/p-puyvsnrt-uo.html" target="_blank" title="CSS响应表不显示"> CSS响应表不显示 </a> </li> <li class="side_article_list_item"> 5. <a href="http://www.uwenku.com/question/p-ehsmmxhr-tw.html" target="_blank" title="如何在UML中建模1到0 .. *聚合"> 如何在UML中建模1到0 .. *聚合 </a> </li> <li class="side_article_list_item"> 6. <a href="http://www.uwenku.com/question/p-tpwqvwan-vh.html" target="_blank" title="Vue公司和子元件构件"> Vue公司和子元件构件 </a> </li> <li class="side_article_list_item"> 7. <a href="http://www.uwenku.com/question/p-ulodlydw-vq.html" target="_blank" title="PowerShell输出在手动和程序化执行之间有所不同"> PowerShell输出在手动和程序化执行之间有所不同 </a> </li> <li class="side_article_list_item"> 8. <a href="http://www.uwenku.com/question/p-ohspesbx-va.html" target="_blank" title="如何制作所需的几个字段之一?"> 如何制作所需的几个字段之一? </a> </li> <li class="side_article_list_item"> 9. <a href="http://www.uwenku.com/question/p-vugosleg-sg.html" target="_blank" title="从一个对象获取变量值 - jquery"> 从一个对象获取变量值 - jquery </a> </li> <li class="side_article_list_item"> 10. <a href="http://www.uwenku.com/question/p-xqfsgrav-ss.html" target="_blank" title="Mongo DB - 群组状态并使用聚合获得总计数"> Mongo DB - 群组状态并使用聚合获得总计数 </a> </li> </ul> </div> </div> </div> <p class="article-nav-bar"></p> <div class="row sidebar article-nav"> <div class="row box_white visible-sm visible-md visible-lg margin-zero"> <div class="top"> <h3 class="title"><i class="glyphicon glyphicon-th-list"></i> 相关问题</h3> </div> <div class="article-relative-content"> <ul class="side_article_list"> <li class="side_article_list_item"> 1. <a href="http://www.uwenku.com/question/p-wdbndzqt-rt.html" target="_blank" title="解析查询包含字符串返回无效响应"> 解析查询包含字符串返回无效响应 </a> </li> <li class="side_article_list_item"> 2. <a href="http://www.uwenku.com/question/p-hyrarlhq-hd.html" target="_blank" title="有返回包含搜索字符串"> 有返回包含搜索字符串 </a> </li> <li class="side_article_list_item"> 3. <a href="http://www.uwenku.com/question/p-ucprgeov-vq.html" target="_blank" title="返回子字符串,如果包含字符"> 返回子字符串,如果包含字符 </a> </li> <li class="side_article_list_item"> 4. <a href="http://www.uwenku.com/question/p-gtufouap-xt.html" target="_blank" title="XML解析 - CDATA"> XML解析 - CDATA </a> </li> <li class="side_article_list_item"> 5. <a href="http://www.uwenku.com/question/p-gozgitan-ye.html" target="_blank" title="解析器换行符如果字符串包含–"> 解析器换行符如果字符串包含– </a> </li> <li class="side_article_list_item"> 6. <a href="http://www.uwenku.com/question/p-zvmuxwnl-bbr.html" target="_blank" title="解析包含字符引用的xml"> 解析包含字符引用的xml </a> </li> <li class="side_article_list_item"> 7. <a href="http://www.uwenku.com/question/p-vuvpwpau-bp.html" target="_blank" title="LINQ应该返回没有记录,如果其中包含空字符串"> LINQ应该返回没有记录,如果其中包含空字符串 </a> </li> <li class="side_article_list_item"> 8. <a href="http://www.uwenku.com/question/p-bimxpwmb-beo.html" target="_blank" title="返回字符串,如果它包含特定的字"> 返回字符串,如果它包含特定的字 </a> </li> <li class="side_article_list_item"> 9. <a href="http://www.uwenku.com/question/p-crepjdzt-qw.html" target="_blank" title="如果字符串包含一个字母,返回整个字符串"> 如果字符串包含一个字母,返回整个字符串 </a> </li> <li class="side_article_list_item"> 10. <a href="http://www.uwenku.com/question/p-exfbpdjx-mh.html" target="_blank" title="WCF如何返回包装在cdata中的字符串?"> WCF如何返回包装在cdata中的字符串? </a> </li> </ul> </div> </div> </div> </div> </div> </div> </div><!-- wrap end--> <!-- footer --> <footer id="footer"> <div class="bg-simple lt"> <div class="container"> <div class="row padder-v m-t"> <div class="col-xs-8"> <ul class="list-inline"> <li><a href="http://www.uwenku.com/contact">联系我们</a></li> <li>© 2020 UWENKU.COM</li> <li><a target="_blank" href="https://beian.miit.gov.cn/">沪ICP备13005482号-4</a></li> <li><script type="text/javascript" src="https://v1.cnzz.com/z_stat.php?id=1280101193&web_id=1280101193"></script></li> <li><a href="http://www.uwenku.com/" target="_blank" title="优文库">简体中文</a></li> <li><a href="http://hk.uwenku.com/" target="_blank" title="優文庫">繁體中文</a></li> <li><a href="http://ru.uwenku.com/" target="_blank" title="поле вопросов и ответов">Русский</a></li> <li><a href="http://de.uwenku.com/" target="_blank" title="Frage - und - antwort - Park">Deutsch</a></li> <li><a href="http://es.uwenku.com/" target="_blank" title="Preguntas y respuestas">Español</a></li> <li><a href="http://hi.uwenku.com/" target="_blank" title="कार्यक्रम प्रश्न और उत्तर पार्क">हिन्दी</a></li> <li><a href="http://it.uwenku.com/" target="_blank" title="IL Programma di chiedere Park">Italiano</a></li> <li><a href="http://ja.uwenku.com/" target="_blank" title="プログラム問答園区">日本語</a></li> <li><a href="http://ko.uwenku.com/" target="_blank" title="프로그램 문답 단지">한국어</a></li> <li><a href="http://pl.uwenku.com/" target="_blank" title="program o park">Polski</a></li> <li><a href="http://tr.uwenku.com/" target="_blank" title="Program soru ve cevap parkı">Türkçe</a></li> <li><a href="http://vi.uwenku.com/" target="_blank" title="Đáp ứng viên">Tiếng Việt</a></li> <li><a href="http://fr.uwenku.com/" target="_blank" title="Programme interrogation Park">Française</a></li> </ul> </div> </div> </div> </div> </div> </footer> <!-- / footer --> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?f78a970f17b19a79fc477a3378096f29"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> </body> </html>