您可以如下操作: -
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
# under season4 currently 7 episodes present, which may change later.
doc.css('#season-4-eps > li').size # => 7
# collect season4 episodes and then their dates and titles
doc.css('#season-4-eps > li').collect { |node| [node.css('.title').text,node.css('.date').text] }
# => [["Mockingbird", "5/18/14"],
# ["The Laws of God and Men", "5/11/14"],
# ["First of His Name", "5/4/14"],
# ["Oathkeeper", "4/27/14"],
# ["Breaker of Chains", "4/20/14"],
# ["The Lion and the Rose", "4/13/14"],
# ["Two Swords", "4/6/14"]]
再次在网页看,我可以看到,它始终与最新一季的数据开放。因此,上面的代码可以作如下修改: -
# how many sessions are present
latest_session = doc.css(".filters > li[data-season]").size # => 4
# collect season4 episodes and then their dates and titles
doc.css("#season-#{latest_session}-eps > li").collect do |node|
p [node.css('.title').text,node.css('.date').text]
end
# >> ["The Mountain and the Viper", "6/1/14"]
# >> ["Mockingbird", "5/18/14"]
# >> ["The Laws of God and Men", "5/11/14"]
# >> ["First of His Name", "5/4/14"]
# >> ["Oathkeeper", "4/27/14"]
# >> ["Breaker of Chains", "4/20/14"]
# >> ["The Lion and the Rose", "4/13/14"]
# >> ["Two Swords", "4/6/14"]
按的评论,似乎OP可以有意从下一集的网页箱送出来的数据。这里是一个方式做同样的:
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
hash = {}
doc.css('div[class ~= next_episode] div.highlight_info').tap do |node|
hash['date'] = node.css('p.highlight_date > span').text[/\d{1,2}\/\d{1,2}\/\d{4}/]
hash['title'] = node.css('div.highlight_name > a').text
end
hash # => {"date"=>"5/18/2014", "title"=>"Mockingbird"}
值得一读tap{|x|...} → obj
息率x
来将挡,然后返回x。该方法的主要目的是“挖掘”方法链,以便对链中的中间结果执行操作。
和str[regexp] → new_str or nil
。
也读CSS selectors
了解选择器是如何与方法#css
。
感谢您的回答! 但是,我将如何改变这一点,以便它无限期地收集最新情节(因为它会自动从季节4到5而不必更改代码)。 – HarryLucas
我不知道上面的代码是如何工作的,但是你可以使用.next_episode类来获得最新的情节相关的HTMl,并且在它里面你会得到相关的信息。 – amitamb
@HarryLucas好的。网页上的数据正在通过* AJAX *加载,而不是由* Nokogiri *加载。为此,您可以使用* selenium-webdriver *执行* AJAX *调用,然后* nokogiri *来删除html页面并从那里获取数据。 –