使用Jsoup存在HTML标签

使用Jsoup很容易计算特定标签在文本中出现的次数。例如，我正试图查看给定文本中存在多少次锚标记。使用Jsoup存在HTML标签

String content = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>"; 
    Document doc = Jsoup.parse(content); 
    Elements links = doc.select("a[href]"); // a with href 
    System.out.println(links.size());

这给我的4计数如果我有一句话，我想知道，如果句子中包含任何HTML标记或没有，这可能与Jsoup？谢谢。

来源

2013-02-15 Rushdi Shams

用正则表达式可能会更好，但如果你真的想使用JSoup，那么你可以尝试匹配所有的ellement，然后减去4，因为JSoup会自动添加四个元素，也就是第一个根元素，然后是<html>,<head>和<body>元素。

这可能松散的样子：

// attempt to count html elements in string - incorrect code, see below 
public static int countHtmlElements(String content) { 
    Document doc = Jsoup.parse(content); 
    Elements elements = doc.select("*"); 
    return elements.size()-4; 
}

然而，这给出了一个错误的结果如果文本包含<html>，<head>或<body>;比较结果如下：

// gives a correct count of 2 html elements 
System.out.println(countHtmlElements("some <b>text</b> with <i>markup</i>")); 
// incorrectly counts 0 elements, as the body is subtracted 
System.out.println(countHtmlElements("<body>this gives a wrong result</body>"));

所以要做到这一点，你必须单独检查“magic”标签;这就是为什么我觉得正则表达式可能更简单。

更多失败的尝试使这项工作：使用parseBodyFragment而不是parse没有帮助，因为这得到了JSoup以相同的方式消毒。同样，计算为doc.select("body *");可以节省减去4的麻烦，但如果涉及<body>，仍会产生错误的计数。只有当您有一个应用程序确定要检查的字符串中不存在<html>,<head>或<body>元素时，它才可能在该限制下工作。

来源

2013-02-15 22:30:53

谢谢。 doc.select（“*”）为我工作，因为我的htmls不包含您提到的标签。但是，是的，我意识到正则表达式会更好地解决这个问题。 – 2013-02-18 18:26:47

使用Jsoup存在HTML标签

回答

相关问题