2014-05-18 90 views
0

我想写一个简单的脚本,告诉我什么时候x节目的下一集将被释放。Nokogiri刮脸问题

这里是我到目前为止有:

require 'rubygems' 
require 'nokogiri' 
require 'open-uri' 

url = "http://www.tv.com/shows/game-of-thrones/episodes/" 
doc = Nokogiri::HTML(open(url)) 

puts doc.at_css('h1').text 
airdate = doc.at_css('.highlight_date span , h1').text 
date = /\W/.match(airdate) 
puts date 

当我运行这一切,它返回是: 游戏宝座

CSS选择我用的还有给人行的airdate是/ XX/xx/xx,但是我只想要日期,所以为什么我使用了/ \ W /虽然我在这里可能完全错误。

所以基本上我只想打印节目标题和下一集的日期。

回答

1

您可以如下操作: -

require 'nokogiri' 
require 'open-uri' 

url = "http://www.tv.com/shows/game-of-thrones/episodes/" 
doc = Nokogiri::HTML(open(url)) 

# under season4 currently 7 episodes present, which may change later. 
doc.css('#season-4-eps > li').size # => 7 

# collect season4 episodes and then their dates and titles 
doc.css('#season-4-eps > li').collect { |node| [node.css('.title').text,node.css('.date').text] } 
# => [["Mockingbird", "5/18/14"], 
#  ["The Laws of God and Men", "5/11/14"], 
#  ["First of His Name", "5/4/14"], 
#  ["Oathkeeper", "4/27/14"], 
#  ["Breaker of Chains", "4/20/14"], 
#  ["The Lion and the Rose", "4/13/14"], 
#  ["Two Swords", "4/6/14"]] 

再次在网页看,我可以看到,它始终与最新一季的数据开放。因此,上面的代码可以作如下修改: -

# how many sessions are present 
latest_session = doc.css(".filters > li[data-season]").size # => 4 

# collect season4 episodes and then their dates and titles 
doc.css("#season-#{latest_session}-eps > li").collect do |node| 
    p [node.css('.title').text,node.css('.date').text] 
end 
# >> ["The Mountain and the Viper", "6/1/14"] 
# >> ["Mockingbird", "5/18/14"] 
# >> ["The Laws of God and Men", "5/11/14"] 
# >> ["First of His Name", "5/4/14"] 
# >> ["Oathkeeper", "4/27/14"] 
# >> ["Breaker of Chains", "4/20/14"] 
# >> ["The Lion and the Rose", "4/13/14"] 
# >> ["Two Swords", "4/6/14"] 

按的评论,似乎OP可以有意从下一集的网页箱送出来的数据。这里是一个方式做同样的:

require 'nokogiri' 
require 'open-uri' 

url = "http://www.tv.com/shows/game-of-thrones/episodes/" 
doc = Nokogiri::HTML(open(url)) 

hash = {} 
doc.css('div[class ~= next_episode] div.highlight_info').tap do |node| 
    hash['date'] = node.css('p.highlight_date > span').text[/\d{1,2}\/\d{1,2}\/\d{4}/] 
    hash['title'] = node.css('div.highlight_name > a').text 
end 

hash # => {"date"=>"5/18/2014", "title"=>"Mockingbird"} 

值得一读tap{|x|...} → obj

息率x来将挡,然后返回x。该方法的主要目的是“挖掘”方法链,以便对链中的中间结果执行操作。

str[regexp] → new_str or nil

也读CSS selectors了解选择器是如何与方法#css

+0

感谢您的回答! 但是,我将如何改变这一点,以便它无限期地收集最新情节(因为它会自动从季节4到5而不必更改代码)。 – HarryLucas

+0

我不知道上面的代码是如何工作的,但是你可以使用.next_episode类来获得最新的情节相关的HTMl,并且在它里面你会得到相关的信息。 – amitamb

+0

@HarryLucas好的。网页上的数据正在通过* AJAX *加载,而不是由* Nokogiri *加载。为此,您可以使用* selenium-webdriver *执行* AJAX *调用,然后* nokogiri *来删除html页面并从那里获取数据。 –