2017-10-28 77 views
0

我在抓这页https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Duhig和每个tr我正在收集并返回可用的计算机名称和数量。Nokogiri迭代tr标签太多次

问题在于它迭代了太多次。只有4个tr标签,但循环经过5次迭代。这会导致额外的nil被追加到返回数组。为什么是这样?

刮科:

<table class="chart"> 
    <tr valign="middle"> 
     <td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl1">Level 1</a></td> 
     <td class="middle"><div style="width:68%;"><strong>68%</strong></div></td> 
     <td class="right">23 Free of 34 PC's</td> 
    </tr> 

    <tr valign="middle"> 
     <td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl2">Level 2</a></td> 
     <td class="middle"><div style="width:78%;"><strong>78%</strong></div></td> 
     <td class="right">83 Free of 107 PC's</td> 
    </tr> 

    <tr valign="middle"> 
     <td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl4">Level 4</a></td> 
     <td class="middle"><div style="width:64%;"><strong>64%</strong></div></td> 
     <td class="right">9 Free of 14 PC's</td> 
    </tr> 

    <tr valign="middle"> 
     <td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl5">Level 5</a></td> 
     <td class="middle"><div style="width:97%;"><strong>97%</strong></div></td> 
     <td class="right">28 Free of 29 PC's</td> 
    </tr> 
</table> 

缩短的方法:

def self.scrape_details_page(library_url) 
    details_page = Nokogiri::HTML(open(library_url)) 

    library_name = details_page.css("h3") 

    details_page.css("table tr").collect do |level| 
     case level.css("a[href]").text.downcase 
      when "level 1" 
       name = level.css("a[href]").text 
       total_available = level.css(".right").text.split(" ")[0] 
       out_of_available = level.css(".right").text.split(" ")[3] 
       level = {name: name, total_available: total_available, out_of_available: out_of_available} 
      when "level 2" 
       name = level.css("a[href]").text 
       total_available = level.css(".right").text.split(" ")[0] 
       out_of_available = level.css(".right").text.split(" ")[3] 
       level = {name: name, total_available: total_available, out_of_available: out_of_available} 
     end 
    end 
end 

回答

1

您可以指定表的类属性,然后访问tr标签里面,这样你可以避免“additional”tr,如:

details_page.css("table.chart tr").map do |level| 
    ... 

和简化一点点scrape_details_page方法:

def scrape_details_page(library_url) 
    details_page = Nokogiri::HTML(open(library_url)) 
    details_page.css('table.chart tr').map do |level| 
    right = level.css('.right').text.split 
    { name: level.css('a[href]').text, total_available: right[0], out_of_available: right[3] } 
    end 
end 

p scrape_details_page('https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Duhig') 

# [{:name=>"Level 1", :total_available=>"22", :out_of_available=>"34"}, 
# {:name=>"Level 2", :total_available=>"98", :out_of_available=>"107"}, 
# {:name=>"Level 4", :total_available=>"12", :out_of_available=>"14"}, 
# {:name=>"Level 5", :total_available=>"26", :out_of_available=>"29"}]