如何通过文本内容获取HTML DOM路径？

一个HTML文件：如何通过文本内容获取HTML DOM路径？

<html> 
    <body> 
     <div class="main"> 
      <p id="tID">content</p> 
     </div> 
    </body> 
</html>

我有一个字符串== "content"，

我想用"content" GET HTML DOM路径：

html body div.main p#tID

Chrome开发者工具有这个功能（要素标签，底部栏），我想知道如何在java中做到这一点？

感谢您的帮助:)

来源

2010-09-04 Koerr

您是指java或javascript？ – aularon 2010-09-04 01:14:33

java，not javascript – Koerr 2010-09-04 01:36:18

玩得开心:)

Java代码

import java.io.File; 

import javax.xml.xpath.XPath; 
import javax.xml.xpath.XPathConstants; 
import javax.xml.xpath.XPathFactory; 

import org.htmlcleaner.CleanerProperties; 
import org.htmlcleaner.DomSerializer; 
import org.htmlcleaner.HtmlCleaner; 
import org.htmlcleaner.TagNode; 
import org.w3c.dom.Document; 
import org.w3c.dom.NamedNodeMap; 
import org.w3c.dom.Node; 



public class Teste { 

    public static void main(String[] args) { 
     try { 
      // read and clean document 
      TagNode tagNode = new HtmlCleaner().clean(new File("test.xml")); 
      Document document = new DomSerializer(new CleanerProperties()).createDOM(tagNode); 

      // use XPath to find target node 
      XPath xpath = XPathFactory.newInstance().newXPath(); 
      Node node = (Node) xpath.evaluate("//*[text()='content']", document, XPathConstants.NODE); 

      // assembles jquery/css selector 
      String result = ""; 
      while (node != null && node.getParentNode() != null) { 
       result = readPath(node) + " " + result; 
       node = node.getParentNode(); 
      } 
      System.out.println(result); 
      // returns html body div#myDiv.foo.bar p#tID 

     } catch (Exception e) { 
      e.printStackTrace(); 
     } 
    } 

    // Gets id and class attributes of this node 
    private static String readPath(Node node) { 
     NamedNodeMap attributes = node.getAttributes(); 
     String id = readAttribute(attributes.getNamedItem("id"), "#"); 
     String clazz = readAttribute(attributes.getNamedItem("class"), "."); 
     return node.getNodeName() + id + clazz; 
    } 

    // Read attribute 
    private static String readAttribute(Node node, String token) { 
     String result = ""; 
     if(node != null) { 
      result = token + node.getTextContent().replace(" ", token); 
     } 
     return result; 
    } 

}

XML实例

<html> 
    <body> 
     <br> 
     <div id="myDiv" class="foo bar"> 
      <p id="tID">content</p> 
     </div> 
    </body> 
</html>

个

解释

对象document点评估XML。
XPath //*[text()='content']发现everthing与text ='content'，并找到该节点。
while循环到第一个节点，获取当前元素的id和类。

更多的解释

在我使用HtmlCleaner这一新的解决方案。因此，例如，您可以有<br>，清洁剂将替换为<br/>。
要使用HtmlCleaner，只需下载最新的罐子here。

来源

2010-09-04 04:21:24 Topera

但它不是XML文档，如果有'
'或其他标签没有结束标签，它将不能解析.'org.xml.sax.SAXParseException：元素类型“br”必须终止匹配的结束标签“
”。“ – Koerr 2010-09-04 12:08:17

获得节点，而父母，这种解决方案是好的。谢谢:) – Koerr 2010-09-04 12:13:28

我编辑了我的答案，使用格式不正确的XML。看一看。 – Topera 2010-09-04 14:12:30

如何通过文本内容获取HTML DOM路径？

回答

相关问题