2011-04-27 34 views
0

我正在使用Java XPath API从xhtml文件提取内容。我正在通过html并试图提取特定内容。包含文本和少数内。当我使用XPath时,奇怪的是,它忽略了所有的html标签并仅提取文本内容。这是一个html代码片段。Java XPath API提取选择性文本

<html> 
<body> 
<div class="content"> 
    <div class="content_wrapper"> 
     <table border="0" cellspacing="0" cellpadding="0" class="test_class"> 
      <tr> 
       <td> 
        <p> 
         Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to 
         download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks. 
        </p> 
        <p style="text-align: center;"> 
         <img src="/testsource/fckdata/208123/image/showcarswatch.jpg" alt="" /> 
         <img src="/testsource/fckdata/208123/image/engineswatch.jpg" alt="" /> 
         <img src="/th.gen/?:760x0:/userdata/fckdata/208123/image/toasterswatch.jpg" alt="" /> 
         <img src="/testsource/fckdata/208123/image/smartphoneswatch.jpg" alt="" /> 
        </p> 
        <p> 
         <br /> 
         Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you 
         just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it:<br /> 
        </p> 
        <p> 
         <strong>Operating System</strong><br /> 
         • Microsoft® Windows® XP Professional (SP 2 or higher)<br /> 
         • Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br /> 
         • Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1) 
        </p> 
       </td> 
      </tr> 
     </table> 
    </div> 
</div> 
</body> 
</html> 

现在,这里是我使用的代码。我需要在使用xpath之前做一些清理。

这里是输出。


Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to 
download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks. 

Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you 
just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it 

Operating System 
• Microsoft® Windows® XP Professional (SP 2 or higher)<br /> 
• Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br /> 
• Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1) 

我需要的只是content_wrapper div中的完整内容。

任何指针将不胜感激。

  • 由于

EDIT响应于扬堡溶液

示例代码。

XPathFactory factory = XPathFactory.newInstance(); 
XPath xpathCompiled = factory.newXPath(); 
XPathExpression expr = xpathCompiled.compile(contentPath); 
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET); 


for (int i = 0; i < nodes.getLength(); i++) { 
    Node n = (Node)nodes.item(i); 
    traverseNodes(n); 
} 

public static void traverseNodes(Node n) { 
    NodeList children = n.getChildNodes(); 
    if(children != null) { 
     for(int i = 0; i &gt; children.getLength(); i++) { 
      Node childNode = children.item(i); 
      System.out.println("node name = " + childNode.getNodeName()); 
      System.out.println("node value = " + childNode.getNodeValue()); 
      System.out.println("node type = " + childNode.getNodeType()); 
      traverseNodes(childNode); 
     } 
    } 
} 
+0

这不是关于XPath表达式,而是关于XPath结果的DOM方法。重新标记。 – 2011-04-28 00:00:04

回答

1

XPath匹配节点集。您的案例中的文本节点,包含子元素节点。 toString()获取那个节点的文本表示,这就是 - 文本,没有元素名称或属性。

你应该得到的节点:

NodeSequence nodes = (NodeSequence)XPathAPI.eval(); 

,然后通过节点走,倾倒你从他们什么都想要,或者将其转换成一个新的DOM文档,例如。

P.S. Xalan很好,但现代Java拥有JAXP。对于代码和知识便携的缘故,我会建议使用(除非是必需的Xalan的扩展/有用):

XPathFactory factory = XPathFactory.newInstance(); 
XPath xpathCompiled = factory.newXPath(); 
XPathExpression expr = xpathCompiled.compile(xpath); 

NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET); 

然后,将其转换成字符串(显然这是你想要的):

StringWriter sw = new StringWriter(); 
Transformer serializer = TransformerFactory.newInstance().newTransformer(); 
serializer.transform(new DOMSource(nodes.item(0)), new StreamResult(sw)); 
String result = sw.toString(); 

请注意,它只接受来自NodeList的第一个元素,因为XML必须具有根元素。在你的情况下,它是好的,如果我理解正确的话,否则你需要在节点集上添加一个顶级元素。

+0

@ yamburg ..感谢您的建议。浏览节点列表会给我节点名称和相应的值。节点名称通常是td而不是​​。以确切格式重建内容会变得有点乏味。也许,我在这里错过了一些东西。我在问题部分添加了示例代码。 – Shamik 2011-04-27 20:48:35

+0

已更新。请以更精确的方式制定你的愿望。 ;) – 2011-04-28 02:32:03

+0

@ yamburg ......谢谢一个人,得到了问题。感谢你的帮助。 – Shamik 2011-04-28 18:04:35