带过滤器的引入nokogiri next_element

比方说，我有一个生病的HTML页面：带过滤器的引入nokogiri next_element

<table> 
<thead> 
    <th class="what_I_need">Super sweet text<th> 
</thead> 
<tr> 
    <td> 
    I also need this 
    </td> 
    <td> 
    and this (all td's in this and subsequent tr's) 
    </td> 
</tr> 
<tr> 
    ...all td's here too 
</tr> 
<tr> 
    ...all td's here too 
</tr> 
</table>

在BeautifulSoup，我们能够得到<th>，然后调用findNext("td")。 Nokogiri调用next_element，但这可能不会返回我想要的（在这种情况下，它将返回tr元素）。

有没有办法过滤Nokogiri的next_element电话？例如next_element("td")？

编辑

为了澄清，我会看很多网站，其中大部分病形成方式不同。

举例来说，接下来的网站可能是：

<table> 
<th class="what_I_need">Super sweet text<th> 
<tr> 
    <td> 
    I also need this 
    </td> 
    <td> 
    and this (all td's in this and subsequent tr's) 
    </td> 
</tr> 
<tr> 
    ...all td's here too 
</tr> 
<tr> 
    ...all td's here too 
</tr> 
</table>

我可以不承担任何其他结构比会有tr s表示有类what_I_need

来源

2012-07-12 Tyler DeWitt

第一项下面，注意您的关闭th标记格式错误：<th>。它应该是</th>。修复有帮助。

一种方式来做到这一点是使用XPath导航到它，一旦你找到了th节点：

require 'nokogiri' 

html = ' 
<table> 
<thead> 
    <th class="what_I_need">Super sweet text<th> 
</thead> 
<tr> 
    <td> 
    I also need this 
    </td> 
<tr> 
</table> 
' 

doc = Nokogiri::HTML(html) 

th = doc.at('th.what_I_need') 
th.text # => "Super sweet text" 
td = th.at('../../tr/td') 
td.text # => "\n I also need this\n "

这走的是引入nokogiri的为使用CSS存取或XPath能力的优势，并做得很透明。

一旦你的<th>节点，您还可以导航使用一些节点的方法：

th.parent.next_element.at('td').text # => "\n I also need this\n "

去做另一种方式，是开始在表的顶部，往下看：

table = doc.at('table') 
th = table.at('th') 
th.text # => "Super sweet text" 
td = table.at('td') 
td.text # => "\n I also need this\n "

如果您需要在表格内访问所有<td>标签，你可以很容易地在他们迭代：

table.search('td').each do |td| 
    # do something with the td... 
    puts td.text 
end

如果通过包含其<tr>叠代则列上的栅格希望所有<td>内容：

table.search('tr').each do |tr| 
    cells = tr.search('td').map(&:text) 
    # do something with all the cells 
end

来源

2012-07-12 21:56:30

感谢您指出了这一点。我最初的问题并不清楚，我不能确定页面结构超出带有'tr's的'table'。我已经更新了这个问题来反映这一点。 – 2012-07-12 22:17:32

重要的是要清楚。有很多不同的方法可以获得你想要的位置。如果您事先不知道页面的布局是什么，您可以编写几个不同的尝试，然后运行每个尝试，然后查看哪个返回值为''查找。如果你有价值，你就很好。如果你不尝试另一个。 – 2012-07-12 22:22:59

我希望找到一个更“通用”的搜索选项。我希望能够说，我找到了我正在寻找的标记，现在返回所有标记后面的''。也许这是不可能的，我只需要为我遇到的每种类型的页面编写一个搜索函数？我从来没有做过刮so，所以我可能是天真的 – 2012-07-12 22:45:46

带过滤器的引入nokogiri next_element

回答

相关问题