如何使用机械化和nokogiri红宝石

给出下面的示例，任何人都可以告诉我如何使用Nokogiri和机械化来获得每个<h4>标签下的所有链接，分别在I.E.如何使用机械化和nokogiri红宝石

“一些文本”
“一些文字”
“一些额外的文本”

<div id="right_holder"> 
    <h3><a href="#"><img src="http://example.com" width="11" height="11"></a></h3> 
    <br /> 
    <br /> 
    <h4><a href="#">Some text</a></h4> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <br /> 
    <br /> 
    <h4><a href="#">Some more text</a></h4> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <br /> 
    <br /> 
    <h4><a href="#">Some additional text</a></h4> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a> 
</div>

来源

2015-04-17 akhanaton

你可以去通过，并区分数据：下的所有链接像“How to split a HTML document using Nokogiri?”，但如果你知道标签是什么，你可以只是split它：

# html is the raw html string 
html.split('<h4').map{|g| Nokogiri::HTML::DocumentFragment.parse(g).css('a') }

page = Nokogiri::HTML(html).css("#right_holder") 
links = page.children.inject([]) do |link_hash, child| 
    if child.name == 'h4' 
    name = child.text 
    link_hash << { :name => name, :content => ""} 
    end 

    next link_hash if link_hash.empty? 
    link_hash.last[:content] << child.to_xhtml 
    link_hash 
end 

grouped_hsh = links.inject({}) do |hsh, link| 
    hsh[link[:name]] = Nokogiri::HTML::DocumentFragment.parse(link[:content]).css('a') 
    hsh 
end 

# {"Some text"=>[#<Nokogiri::XML::Element:0x3ff4860d6c30, 
# "Some more text"=>[#<Nokogiri::XML::Element:0x3ff486096c20..., 
# "Some additional text"=>[#<Nokogiri::XML::Element:0x3ff486f2de78...}

来源

2015-04-17 22:05:09 Ebtoulson

这得到所有链接，但不根据

标签分开它们，我需要知道每个链接的哪个

标签来自。谢谢 – akhanaton

我更新了我的解决方案，以遵循我已链接的策略。我的原始解决方案有'h4 a'链接作为数组中的第一个链接，但它也包含了'h4'之前的任何链接。 – Ebtoulson

谢谢，似乎工作。 – akhanaton

一般来说，你会怎么做：

page.search('h4 a').each do |a| 
    puts a[:href] 
end

但我敢肯定你已经注意到，没有这些链接其实去任何地方。

更新：

将它们分组怎么样的一些节点集数学：

page.search('h4').each do |h4| 
    puts h4.text 
    (h4.search('~ a') - h4.search('~ h4 ~ a')).each do |a| 
    puts a.text 
    end 
end

这意味着每a下面的h4，不也跟着另一个h4

来源

2015-04-17 22:53:06 pguardiario

我认为@akhanaton希望每个“h4 a”下的链接不是实际的“h4 a”链接。 – Ebtoulson

@akhanton，在这种情况下它是：'h4〜a' – pguardiario

这将获得所有链接，但不会根据

如何使用机械化和nokogiri红宝石

回答

标签分开它们，我需要知道每个链接的哪个

标签来自。谢谢 – akhanaton

标记分开它们，我需要知道每个链接的哪个

标记来自。谢谢 – akhanaton

如何使用机械化和nokogiri红宝石

回答

标签分开它们，我需要知道每个链接的哪个

标签来自。谢谢 – akhanaton

标记分开它们，我需要知道每个链接的哪个

标记来自。谢谢 – akhanaton

相关问题