2016-02-05 27 views
1

我有如下一段HTML的:Jsoup标签名()给出了错误的标签

<p>       
    <a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a> 
    <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a> 
    <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> 
    <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>  
    </p>       

这件作品是从网页http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861

而且一段代码:

Document document = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get(); 
    String tag = null; 
    for (Element element : document.select("*")) { 
     tag = element.tagName(); 

     if ("a".equalsIgnoreCase(tag)) { 
      LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling()); 
     } 


     if (StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah")) { 
      LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling()); 
      LOGGER.info("tag : {}; nextNodeSibling: {}", tag, element.nextSibling()); 
      LOGGER.info("element : {}; previousElementSibling: {}", element.ownText(), element.previousElementSibling()); 
     } 

} 

输出我得到:

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: null 
    tag : h2; nextNodeSibling: 
    element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null 

有许多的问题:

  1. 从主HTML源有标记为a许多元素,但没有从小型HTML一块我核对
  2. 看来<a>被捕获为<h2>
  3. element.nextElementSibling()在大多数情况下为空

但是,如果单独针对小块进行测试,问题就会消失。因此,看起来Jsoup在出现在更大的HTML源代码中时无法正确识别标签。

任何想法为什么?

谢谢。

EDIT 2

演习背后的用意是清理网页。这就是为什么我遍历整个HTML,而不是像@Stephan所建议的特定部分。我只挑选了一个看起来有问题的特定部分。

但是在检查@luksch的回应之后,我重新查看了原始的HTML并找到了从中拍摄的异常情况。代码全面查看所有标签,但给出例外a。在的主要来源,我们有article随后afigure(包含iimgimgsmallsmall),h2。这个问题似乎像所有的标签(a除外)都被删除(按要求工作),但他们的text被留下。这就是为什么我最终留下了​​这是不是原来的HTML源代码。

吉尔·马丁从她的客房乱抢救萨凡纳格思里是<h2>文本,但<h2>是被删除,留下它的文本后面。有趣的是,Jsoup仍然认为文本来自h2,尽管最终输出没有h2

+0

该片段是大型代码的一部分。原始链接是“http:// www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861”。因此,较大的文档应该是'Document doc = Jsoup.connect(“http://www.today.com/home/decorating-ideas-david-bromstad-shares- tips-living-luxury-less-t70861”) .get();' –

+0

URL给了我一个404 – luksch

+0

@luksch,当我复制粘贴时,它出现错误。这是调用:Jsoup.connect(“http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861”).get();. '生活'之后的单词是'奢侈',但复制粘贴错误。 –

回答

0

我认为选择器需要更具体。

而不是document.select("*"),请尝试document.select("a")

0

这对我来说是不可重现的。下面的程序打印出正是你所期望的:

String html = "" 
     +"<p>" 
     +" <a href=\"http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959\" rel=\"nofollow\"> Jill Martin rescues Savannah Guthrie from her guest room mess </a> " 
     +" <a href=\"http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678\" rel=\"nofollow\"> 4 simple ways to clear your clutter this year </a> " 
     +" <a href=\"http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814\" rel=\"nofollow\"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> " 
     +" <a href=\"http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749\" rel=\"nofollow\"> Here's how to set a functional Christmas table </a> " 
     +"</p>"; 

Document doc = Jsoup.parse(html); 

String tag = null; 
for (Element element : doc.select("*")) { 
    tag = element.tagName(); 

    if ("a".equalsIgnoreCase(tag)) { 
     System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+""); 

    } 
    if (StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah")) { 
     System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+""); 
     System.out.println("tag : "+tag+"; nextNodeSibling: "+element.nextSibling()+""); 
     System.out.println("element : "+element.ownText()+"; previousElementSibling: "+element.previousElementSibling()+""); 
    } 
} 

结果是:

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a> 
tag : a; nextNodeSibling: 
element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null 
element : 4 simple ways to clear your clutter this year; nextElementSibling: <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> 
element : Staying home on New Year's Eve? Great ideas to celebrate at home; nextElementSibling: <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a> 
element : Here's how to set a functional Christmas table; nextElementSibling: null 

也许你用一个错误的JSoup版本?上述与版本1.8.3

+0

这段代码是大代码的一部分。我刚刚提取了我认为不起作用的部分。一般来说,我试图在'http:// www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861'解析内容(其中包含我发布的代码片段)。而不是'Document doc = Jsoup.parse(html);'try' Document doc = Jsoup.connect(“http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living- luxury-less-t70861“)。get();' –

+0

以前的复制粘贴有问题。正确的调用是'Jsoup.connect(“http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861”).get();' –

1

你给的网址运行包含此元素:

<a class="player-tease-link" href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959"> 
<figure class="player-tease"> 
    <i class="player-tease-icon icon-video-play"></i> 
    <img class="tease-icon-play" src="http://nodeassets.today.com/img/svg/641a740d.video-play-white.svg" alt="Play"> 
    <img class="tease-image" src="http://media1.s-nbcnews.com/j/MSNBC/Components/Video/__NEW/tdy_guth_clutter_160120.today-vid-post-small-desktop.jpg" title="Jill Martin rescues Savannah Guthrie from her guest room mess" alt="Jill Martin rescues Savannah Guthrie from her guest room mess"> 
    <small class="tease-sponsored">Sponsored Content</small> 
    <small class="tease-playing">Now Playing</small> 
</figure> 
<h2 class="player-tease-headline">Jill Martin rescues Savannah Guthrie from her guest room mess</h2> 
</a> 

如此看来,你是比较桔子苹果,这意味着HTML片段,你也给不原始HTML的一部分。我想你使用了一些工具来提取已经改变了HTML。请注意,a元素不包含任何自己的文本!

一个好主意是遵循@Stephan的建议并学习如何使用CSS selectors properly。这应该比选择全部然后在程序代码中手动过滤更有效。这里是你可以做一个例子:

Elements interestingAs = document.select("a:matches(^Jill Martin)"); 

这将选择包含文本的开始。“吉尔·马丁”所有a元素。

+0

I回顾了HTML的源代码,并与最终输出结果进行比较,发现异常。简而言之,一些标签被删除,但留下了他们的“文本”。如果父母没有被删除,留下的文本被分配给这个标签(父母)。我们最终输出的标签文本错误。 –