2013-04-25 21 views
0

当我遇到一个包含表内表的链接时,我正在处理html表。我已经提取如下整个URL中的第一个表,使用jsoup css选择器检索表内第一个最内部表

final Document document = Jsoup.connect("http://www.webdesign.org/html-and-css/tutorials/table-examples.6139.html").get(); 
final Elements tables = document.select("table");  
final Element table = tables.get(0); 

现在我想用Jsoup CSS选择器内下方的HTML提取第一最里面的桌子,

<table cellspacing="0" cellpadding="0"> 
<tbody> 
    <tr> 
    <td id="header_left"><a href="/"> 
    <div id="logo"></div></a> 
    <!-- end logo --></td> 
    <td id="header_center"> 
    <div id="header_menu"> 
    <h2><a href="http://www.templatemonster.com" target="_blank">WEB DESIGN TEMPLATES</a></h2> 
    <p><a href="http://www.templatemonster.com/website-templates.php/?aff=wdl">HTML &amp; CSS Templates</a></p> 
    <p><a href="http://www.templatemonster.com/wordpress-themes.php/?aff=wdl">Wordpress Themes</a></p> 
    <p><a href="http://www.templatemonster.com/prestashop-themes.php/?aff=wdl">PrestaShop Themes</a></p> 
    </div> 
    <!-- end header_nemu --> 
    <div id="header_books"></div> 
    <!-- end header_books --> </td> 
    <td id="header_right"> 
    <div id="search_pic"></div> 
    <!-- end search_pic --> 
    <div id="header_search_div"> 
    <div class="block-search-heading"> 
     SEARCH 
    </div> 
    <form method="get" action="/search.html"> 
     <table> 
     <tbody> 
     <tr> 
     <td colspan="2" class="keyword"><input type="text" id="search-keyword" name="keywords" value="" title=" - Any Keyword(s) - " /></td> 
     </tr> 
     <tr> 
     <td class="category"><select id="category" name="category"> <option value="0" style="font-weight:bold;">- All categories -</option> <option value="-1" style="font-weight:bold;">Website Templates</option><option value="1" style="font-weight: bold; ">Web Design Basics</option><option value="26">&nbsp;&nbsp;Web Design Showcase</option><option value="2">&nbsp;&nbsp;Design Principles</option><option value="108">&nbsp;&nbsp;Typography</option><option value="111">&nbsp;&nbsp;Responsive Design</option><option value="99" style="font-weight: bold; ">CMS</option><option value="102">&nbsp;&nbsp;Drupal</option><option value="103">&nbsp;&nbsp;Joomla</option><option value="100">&nbsp;&nbsp;Wordpress</option><option value="109" style="font-weight: bold; ">Tutorials</option><option value="7">&nbsp;&nbsp;Photoshop</option><option value="97">&nbsp;&nbsp;&nbsp;&nbsp;Editor's Pick</option><option value="60">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop Basics</option><option value="61">&nbsp;&nbsp;&nbsp;&nbsp;Special Effects</option><option value="62">&nbsp;&nbsp;&nbsp;&nbsp;Text Effects</option><option value="63">&nbsp;&nbsp;&nbsp;&nbsp;3D Effects</option><option value="64">&nbsp;&nbsp;&nbsp;&nbsp;Textures &amp; Patterns</option><option value="65">&nbsp;&nbsp;&nbsp;&nbsp;Web Layout</option><option value="66">&nbsp;&nbsp;&nbsp;&nbsp;Drawing Techniques</option><option value="67">&nbsp;&nbsp;&nbsp;&nbsp;Color Management</option><option value="68">&nbsp;&nbsp;&nbsp;&nbsp;Photo Editing</option><option value="69">&nbsp;&nbsp;&nbsp;&nbsp;ImageReady Animation</option><option value="72">&nbsp;&nbsp;&nbsp;&nbsp;Miscellaneous</option><option value="81">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS4 Tutorials</option><option value="98">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS5 Tutorials</option><option value="105">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS6 Tutorials</option><option value="53">&nbsp;&nbsp;Vector Graphics</option><option value="21">&nbsp;&nbsp;HTML and CSS</option><option value="30" style="font-weight: bold; ">Miscellaneous</option><option value="50">&nbsp;&nbsp;Interviews</option><option value="104">&nbsp;&nbsp;Inspiration</option><option value="110">&nbsp;&nbsp;Freebies</option></select></td> 
     <td class="submit"><input type="submit" value="" /></td> 
     </tr> 
     </tbody> 
     </table> 
    </form> 
    </div> 
    <!-- end header_search_div --></td> 
    </tr> 
</tbody> 
</table> 

我想要得到的表或第一个最里面的桌子,来到这张桌子里面,

<table> 
     <tbody> 
     <tr> 
     <td colspan="2" class="keyword"><input type="text" id="search-keyword" name="keywords" value="" title=" - Any Keyword(s) - " /></td> 
     </tr> 
     <tr> 
     <td class="category"><select id="category" name="category"> <option value="0" style="font-weight:bold;">- All categories -</option> <option value="-1" style="font-weight:bold;">Website Templates</option><option value="1" style="font-weight: bold; ">Web Design Basics</option><option value="26">&nbsp;&nbsp;Web Design Showcase</option><option value="2">&nbsp;&nbsp;Design Principles</option><option value="108">&nbsp;&nbsp;Typography</option><option value="111">&nbsp;&nbsp;Responsive Design</option><option value="99" style="font-weight: bold; ">CMS</option><option value="102">&nbsp;&nbsp;Drupal</option><option value="103">&nbsp;&nbsp;Joomla</option><option value="100">&nbsp;&nbsp;Wordpress</option><option value="109" style="font-weight: bold; ">Tutorials</option><option value="7">&nbsp;&nbsp;Photoshop</option><option value="97">&nbsp;&nbsp;&nbsp;&nbsp;Editor's Pick</option><option value="60">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop Basics</option><option value="61">&nbsp;&nbsp;&nbsp;&nbsp;Special Effects</option><option value="62">&nbsp;&nbsp;&nbsp;&nbsp;Text Effects</option><option value="63">&nbsp;&nbsp;&nbsp;&nbsp;3D Effects</option><option value="64">&nbsp;&nbsp;&nbsp;&nbsp;Textures &amp; Patterns</option><option value="65">&nbsp;&nbsp;&nbsp;&nbsp;Web Layout</option><option value="66">&nbsp;&nbsp;&nbsp;&nbsp;Drawing Techniques</option><option value="67">&nbsp;&nbsp;&nbsp;&nbsp;Color Management</option><option value="68">&nbsp;&nbsp;&nbsp;&nbsp;Photo Editing</option><option value="69">&nbsp;&nbsp;&nbsp;&nbsp;ImageReady Animation</option><option value="72">&nbsp;&nbsp;&nbsp;&nbsp;Miscellaneous</option><option value="81">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS4 Tutorials</option><option value="98">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS5 Tutorials</option><option value="105">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS6 Tutorials</option><option value="53">&nbsp;&nbsp;Vector Graphics</option><option value="21">&nbsp;&nbsp;HTML and CSS</option><option value="30" style="font-weight: bold; ">Miscellaneous</option><option value="50">&nbsp;&nbsp;Interviews</option><option value="104">&nbsp;&nbsp;Inspiration</option><option value="110">&nbsp;&nbsp;Freebies</option></select></td> 
     <td class="submit"><input type="submit" value="" /></td> 
     </tr> 
     </tbody> 
     </table> 

我真的很在意怎么做。任何指针都会很有帮助。

回答

3

从我所知道的你不能选择最内在元素与CSS和jsoup选择器语法。如果第一个不存在,您也不能选择这个元素。在jsoup选择的

语法如下:http://jsoup.org/cookbook/extracting-data/selector-syntax

Jsoup选择主要是像CSS而jsop有一组特殊的伪类的(在他们的文档招呼他们伪选择)。

要找到CSS类“块检索”的表:

Elements elements = doc.select("table.block-search"); 

要找到CSS类“块检索”这确实是在<table cellspacing="0" cellpadding="0" id="header_tab">表:

Elements elements = doc.select("table#header_tab table.block-search"); 

要在<table cellspacing="0" cellpadding="0" id="header_tab">中找到第一个带有“块搜索”类的子表:

Element element = doc.select("table#header_tab table.block-search").first(); 

UPD

希望这对你有用。与current = current.children().select("table").first();

import java.io.IOException; 

import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 

public class AppJsoap { 

    public static void main(String... args) throws IOException { 

     Document document = Jsoup 
       .connect(
         "http://www.webdesign.org/html-and-css/tutorials/table-examples.6139.html") 
       .get(); 
     Elements tables = document.select("table table"); 

     System.out.println(tables.size()); 
     for (Element el : tables) { 
      System.out.println(path(el)); 
     } 

     { 
      System.out.println("------"); 
      Element found = null; 
      Element current = tables.get(0); 
      while (current != null) { 
       System.out.println("current = " + path(current)); 
       found = current; 
       current = current.children().select("table").first(); 
      } 
      System.out.println("found = " + path(found)); 
     } 
    } 

    public static String path(Element el) { 
     String path = el.parent() != null ? path(el.parent()) : ""; 
     path += el.nodeName() + "[" + el.siblingIndex() + "] "; 
     return path; 
    } 
} 

输出

31 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] tbody[1] tr[0] td[5] div[4] form[3] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[3] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[7] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[11] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[15] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[19] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[23] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[27] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[31] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[35] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[39] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[43] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[47] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[51] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[55] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[59] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[63] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[67] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[71] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[75] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[79] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[83] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[87] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[14] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[22] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[5] div[1] div[1] div[3] form[1] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[7] div[2] div[2] div[2] div[3] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[4] td[3] table[25] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[4] td[3] table[29] 
------ 
current = #document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] 
current = #document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] tbody[1] tr[0] td[5] div[4] form[3] table[1] 
found = #document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] tbody[1] tr[0] td[5] div[4] form[3] table[1] 
+0

我不想用表的类或id进行搜索。因为它可以是不同的网址。我希望它是通用的,它会让我最内在的表。 – 2013-04-25 06:07:46

+0

编辑代码。现在表不包含任何类 – 2013-04-25 06:13:23

+0

然后我想你需要''while'循环与'currentTable = currentTable.select(“table”)。first()' – Vitaly 2013-04-25 06:16:40

0

注意最后while做着命中和试验后,我终于找到了答案。以下是代码,

Document document = Jsoup.connect("http://www.webdesign.org/html-and-css/tutorials/table-examples.6139.html").get(); 
Elements tables = document.select("table");  
Element table = tables.get(0); 

// Checks if a table contains table inside it 
while(! table.select(":has(table)").isEmpty()){ 
    table = table.select("table table").first(); 
} 

它检索表内的第一个最里面的表。