Jsoup标签名（）给出了错误的标签

我有如下一段HTML的：Jsoup标签名（）给出了错误的标签

<p>       
    <a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a> 
    <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a> 
    <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> 
    <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>  
    </p>

这件作品是从网页http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861

而且一段代码：

Document document = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get(); 
    String tag = null; 
    for (Element element : document.select("*")) { 
     tag = element.tagName(); 

     if ("a".equalsIgnoreCase(tag)) { 
      LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling()); 
     } 


     if (StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah")) { 
      LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling()); 
      LOGGER.info("tag : {}; nextNodeSibling: {}", tag, element.nextSibling()); 
      LOGGER.info("element : {}; previousElementSibling: {}", element.ownText(), element.previousElementSibling()); 
     } 

}

输出我得到：

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: null 
    tag : h2; nextNodeSibling: 
    element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null

有许多的问题：

从主HTML源有标记为a许多元素，但没有从小型HTML一块我核对
看来<a>被捕获为<h2>
element.nextElementSibling()在大多数情况下为空

但是，如果单独针对小块进行测试，问题就会消失。因此，看起来Jsoup在出现在更大的HTML源代码中时无法正确识别标签。

任何想法为什么？

谢谢。

EDIT 2

演习背后的用意是清理网页。这就是为什么我遍历整个HTML，而不是像@Stephan所建议的特定部分。我只挑选了一个看起来有问题的特定部分。

但是在检查@luksch的回应之后，我重新查看了原始的HTML并找到了从中拍摄的异常情况。代码全面查看所有标签，但给出例外a。在的主要来源，我们有article随后a，figure（包含i，img，img，small，small），h2。这个问题似乎像所有的标签（a除外）都被删除（按要求工作），但他们的text被留下。这就是为什么我最终留下了这是不是原来的HTML源代码。

的吉尔·马丁从她的客房乱抢救萨凡纳格思里是<h2>文本，但<h2>是被删除，留下它的文本后面。有趣的是，Jsoup仍然认为文本来自h2，尽管最终输出没有h2。

来源

2016-02-05 Mugoma J. Okomba

该片段是大型代码的一部分。原始链接是“http：// www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861”。因此，较大的文档应该是'Document doc = Jsoup.connect（“http://www.today.com/home/decorating-ideas-david-bromstad-shares- tips-living-luxury-less-t70861”） .get（）;' –

URL给了我一个404 – luksch

@luksch，当我复制粘贴时，它出现错误。这是调用：Jsoup.connect（“http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861”）.get（）;. '生活'之后的单词是'奢侈'，但复制粘贴错误。 –

我认为选择器需要更具体。

而不是document.select("*")，请尝试document.select("a")。

来源

2016-02-05 06:42:03 Stephan

这对我来说是不可重现的。下面的程序打印出正是你所期望的：

String html = "" 
     +"<p>" 
     +" <a href=\"http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959\" rel=\"nofollow\"> Jill Martin rescues Savannah Guthrie from her guest room mess </a> " 
     +" <a href=\"http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678\" rel=\"nofollow\"> 4 simple ways to clear your clutter this year </a> " 
     +" <a href=\"http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814\" rel=\"nofollow\"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> " 
     +" <a href=\"http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749\" rel=\"nofollow\"> Here's how to set a functional Christmas table </a> " 
     +"</p>"; 

Document doc = Jsoup.parse(html); 

String tag = null; 
for (Element element : doc.select("*")) { 
    tag = element.tagName(); 

    if ("a".equalsIgnoreCase(tag)) { 
     System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+""); 

    } 
    if (StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah")) { 
     System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+""); 
     System.out.println("tag : "+tag+"; nextNodeSibling: "+element.nextSibling()+""); 
     System.out.println("element : "+element.ownText()+"; previousElementSibling: "+element.previousElementSibling()+""); 
    } 
}

结果是：

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a> 
tag : a; nextNodeSibling: 
element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null 
element : 4 simple ways to clear your clutter this year; nextElementSibling: <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> 
element : Staying home on New Year's Eve? Great ideas to celebrate at home; nextElementSibling: <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a> 
element : Here's how to set a functional Christmas table; nextElementSibling: null

也许你用一个错误的JSoup版本？上述与版本1.8.3

来源

2016-02-05 10:38:15 luksch

这段代码是大代码的一部分。我刚刚提取了我认为不起作用的部分。一般来说，我试图在'http：// www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861'解析内容（其中包含我发布的代码片段）。而不是'Document doc = Jsoup.parse（html）;'try' Document doc = Jsoup.connect（“http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living- luxury-less-t70861“）。get（）;' –

以前的复制粘贴有问题。正确的调用是'Jsoup.connect（“http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861”）.get（）;' –

你给的网址运行包含此元素：

<a class="player-tease-link" href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959"> 
<figure class="player-tease"> 
    <i class="player-tease-icon icon-video-play"></i> 
    <img class="tease-icon-play" src="http://nodeassets.today.com/img/svg/641a740d.video-play-white.svg" alt="Play"> 
    <img class="tease-image" src="http://media1.s-nbcnews.com/j/MSNBC/Components/Video/__NEW/tdy_guth_clutter_160120.today-vid-post-small-desktop.jpg" title="Jill Martin rescues Savannah Guthrie from her guest room mess" alt="Jill Martin rescues Savannah Guthrie from her guest room mess"> 
    <small class="tease-sponsored">Sponsored Content</small> 
    <small class="tease-playing">Now Playing</small> 
</figure> 
<h2 class="player-tease-headline">Jill Martin rescues Savannah Guthrie from her guest room mess</h2> 
</a>

如此看来，你是比较桔子苹果，这意味着HTML片段，你也给不原始HTML的一部分。我想你使用了一些工具来提取已经改变了HTML。请注意，a元素不包含任何自己的文本！

一个好主意是遵循@Stephan的建议并学习如何使用CSS selectors properly。这应该比选择全部然后在程序代码中手动过滤更有效。这里是你可以做一个例子：

Elements interestingAs = document.select("a:matches(^Jill Martin)");

这将选择包含文本的开始。“吉尔·马丁”所有a元素。

来源

2016-02-06 13:18:39 luksch

I回顾了HTML的源代码，并与最终输出结果进行比较，发现异常。简而言之，一些标签被删除，但留下了他们的“文本”。如果父母没有被删除，留下的文本被分配给这个标签（父母）。我们最终输出的标签文本错误。 –

Jsoup标签名（）给出了错误的标签

回答

相关问题