0
我使用DOM解析器读取RSS的HTML标签饲料像这样的机器人中:解析XML,没有字符串返回如果CDATA不包含
...
<item cbc:type="story" cbc:deptid="2.663" cbc:syndicate="true">
<title>
<![CDATA[
Asian carp have reproduced in Great Lakes watershed
]]>
</title>
<link>
http://www.cbc.ca/news/canada/windsor/asian-carp-have-reproduced-in-great-lakes-watershed-1.2286554?cmp=rss
</link>
<guid isPermaLink="false">1.2286554</guid>
<pubDate>Tue, 29 Oct 2013 08:06:48 EDT</pubDate>
<description>
<![CDATA[
<img title='Fisheries and Oceans Canada and the Ontario Ministry of Natural Resources confirmed one grass carp was caught in the Grand River near Lake Erie. ' height='259' alt='hi-20130502-grass_carp-dfo-852' width='460' src='http://i.cbc.ca/1.1663916.1379078358!/httpImage/image.jpg_gen/derivatives/16x9_460/hi-20130502-grass-carp-dfo-852.jpg' /> <p>Scientists said Monday they have documented for the first time that an Asian carp species has successfully reproduced within the Great Lakes watershed, an ominous development in the struggle to slam the door on the hungry invaders that could threaten native fish.</p>
]]>
</description>
</item>
...
xmlParser.class:
public class xmlParser {
public Document getDomElement(String rssFilePath, String fileName){
Log.d("GET", ""+rssFilePath+fileName);
Document doc = null;
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setCoalescing(true);
FileInputStream fis;
try {
DocumentBuilder db = dbf.newDocumentBuilder();
File tmp2 = new File (rssFilePath,"/"+ fileName);
fis = new FileInputStream(tmp2);
InputSource is = new InputSource();
is.setByteStream(fis);
doc = db.parse(is);
} catch (ParserConfigurationException e) {
Log.e("Error: ", e.getMessage());
return null;
} catch (SAXException e) {
Log.e("Error: ", e.getMessage());
return null;
} catch (IOException e) {
Log.e("Error: ", e.getMessage());
return null;
}
// return DOM
// Log.d("DOM", doc.toString());
return doc;
}
public String getValue(Element item, String str) {
NodeList n = item.getElementsByTagName(str);
return this.getElementValue(n.item(0));
}
public final String getElementValue(Node elem) {
Node child;
if(elem != null){
if (elem.hasChildNodes()){
for(child = elem.getFirstChild(); child != null; child = child.getNextSibling()){
if(child.getNodeType() == Node.TEXT_NODE ){
return child.getNodeValue();
}
}
}
}
return "";
}
}
从我的主要活动:
//Parse the XML content
xmlParser parser = new xmlParser();
Log.d(TAG, "1");
Document rssDoc = parser.getDomElement(rssFilePath, rssFileName);
Log.d(TAG, "2");
final NodeList nl = rssDoc.getElementsByTagName(KEY_ITEM);
Log.d(TAG, "3");
//Make it all look nice and strip HTML
for (int i = 0; i < nl.getLength(); i++){
Element e = (Element) nl.item(i);
String noHtmlTitle = parser.getValue(e, KEY_TITLE).toString().replaceAll("\\<.*?>", "");
noHtmlTitle = noHtmlTitle.replaceAll("/n", "");
noHtmlTitle = noHtmlTitle.trim();
titles.add(noHtmlTitle);
String noHtmlDesc = parser.getValue(e, KEY_DESC).toString().replaceAll("\\<.*?>", "");
noHtmlDesc = noHtmlDesc.trim();
descs.add("\n" + noHtmlDesc);
}
然而,当这个代码呈现上述“标题”“/标题”标签,它retur ns一个空白字符串。这似乎与“标题”标签不包含任何HTML标签的事实有关。
如何从标题标签中检索可用的字符串?
让我知道我是否排除了任何所需的数据。
编辑:
作为每blahdiblah,正在返回的数据类型是CDATA_SECTION_NODE。我修改了getElementValue方法以包括此数据类型:
为文本节点(child.getNodeType() == Node.TEXT_NODE
)
public final String getElementValue(Node elem) {
Node child;
if(elem != null){
if (elem.hasChildNodes()){
for(child = elem.getFirstChild(); child != null; child = child.getNextSibling()){
if(child.getNodeType() == Node.TEXT_NODE ){
return child.getNodeValue();
}else if (child.getNodeType() == Node.CDATA_SECTION_NODE){
return child.getNodeValue();
}
}
}
}
return "";
}
'replaceAll'前的'parser.getValue(e,KEY_TITLE).toString()'的值是什么? – blahdiblah
它返回空白。 – keag
它看起来像'getElementValue'只返回一个类型为'TEXT_NODE'节点的值,我猜'