2012-11-11 37 views
1

因此,这里是场景。我有一个很大的html文件,我想用JSoup进行修改。我是新手,我一直在浏览一些教程和API参考。我有以下的HTML块。使用JSoup检索p标签之间的所有html

<p><a name="bob"></a> 
<table class='schedules'> 
<tr><td align='center' colspan="5"><b>Bob the Builder</b><br> 
<a href="blah blah" class='tiny'>Blah Blah Blah</a></td></tr> 
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr> 
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr> 
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr> 
<tr><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><!--<td class='whoohaa'><a href="random/randomUrl.htm">Blah</a></td>--><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='cc'><a href="random/randomUrl.htm">blah</a></td><td class='cc'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr> 
<tr></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr> 
<tr><td class='sk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td></tr> 
</table> 
</p> 

现在有更多的这些块按照类似的模式,即(第一行)的名称属性发生变化(从“鲍勃”到别的东西)。我想要做的是首先能够选择“bob”p块,然后检索所有html,直到最后一行中的终止p块。

我已经尝试以下操作:

Elements innerStuff = doc.select("a:contains(bob) ~ *"); 

但只给我HREF atrributes,我猜是什么将有望链接。但是,我很难看到我还能如何解决这个问题?

我们非常感谢您在这方面的帮助。

回答

1

A选择基于其名称属性标签更straitforward的办法是做:

doc.select("a[name=bob]") 

从那里,你应该能够浏览到你想使用父()(以元素获取包含的链接)例如p标签(你需要先(打电话)之前得到的选择相匹配的第一个(也是唯一一个)元素):

doc.select("a[name=bob]").first().parent() 

的一个问题,但:解析HTML文档与原始HTML不同: 这是原来的HTML结构:

p 
    a[name=bob] 
    table 
     ... 

这里的解析HTML的样子:

p 
    a[name=bob] 
table 
    ... 
p 

因此,从链接标签开始,并获得该表的元素,你需要去上一级(到段落),并抓住下一个元素:

doc.select("a[name=bob]").first().parent().nextElementSibling() 
+0

什么传奇! :)欣赏它。 – rameezk