Nokogiri刮脸问题

我想写一个简单的脚本，告诉我什么时候x节目的下一集将被释放。Nokogiri刮脸问题

这里是我到目前为止有：

require 'rubygems' 
require 'nokogiri' 
require 'open-uri' 

url = "http://www.tv.com/shows/game-of-thrones/episodes/" 
doc = Nokogiri::HTML(open(url)) 

puts doc.at_css('h1').text 
airdate = doc.at_css('.highlight_date span , h1').text 
date = /\W/.match(airdate) 
puts date

当我运行这一切，它返回是：游戏宝座

CSS选择我用的还有给人行的airdate是/ XX/xx/xx，但是我只想要日期，所以为什么我使用了/ \ W /虽然我在这里可能完全错误。

所以基本上我只想打印节目标题和下一集的日期。

来源

2014-05-18 HarryLucas

您可以如下操作： -

require 'nokogiri' 
require 'open-uri' 

url = "http://www.tv.com/shows/game-of-thrones/episodes/" 
doc = Nokogiri::HTML(open(url)) 

# under season4 currently 7 episodes present, which may change later. 
doc.css('#season-4-eps > li').size # => 7 

# collect season4 episodes and then their dates and titles 
doc.css('#season-4-eps > li').collect { |node| [node.css('.title').text,node.css('.date').text] } 
# => [["Mockingbird", "5/18/14"], 
#  ["The Laws of God and Men", "5/11/14"], 
#  ["First of His Name", "5/4/14"], 
#  ["Oathkeeper", "4/27/14"], 
#  ["Breaker of Chains", "4/20/14"], 
#  ["The Lion and the Rose", "4/13/14"], 
#  ["Two Swords", "4/6/14"]]

再次在网页看，我可以看到，它始终与最新一季的数据开放。因此，上面的代码可以作如下修改： -

# how many sessions are present 
latest_session = doc.css(".filters > li[data-season]").size # => 4 

# collect season4 episodes and then their dates and titles 
doc.css("#season-#{latest_session}-eps > li").collect do |node| 
    p [node.css('.title').text,node.css('.date').text] 
end 
# >> ["The Mountain and the Viper", "6/1/14"] 
# >> ["Mockingbird", "5/18/14"] 
# >> ["The Laws of God and Men", "5/11/14"] 
# >> ["First of His Name", "5/4/14"] 
# >> ["Oathkeeper", "4/27/14"] 
# >> ["Breaker of Chains", "4/20/14"] 
# >> ["The Lion and the Rose", "4/13/14"] 
# >> ["Two Swords", "4/6/14"]

按的评论，似乎OP可以有意从下一集的网页箱送出来的数据。这里是一个方式做同样的：

require 'nokogiri' 
require 'open-uri' 

url = "http://www.tv.com/shows/game-of-thrones/episodes/" 
doc = Nokogiri::HTML(open(url)) 

hash = {} 
doc.css('div[class ~= next_episode] div.highlight_info').tap do |node| 
    hash['date'] = node.css('p.highlight_date > span').text[/\d{1,2}\/\d{1,2}\/\d{4}/] 
    hash['title'] = node.css('div.highlight_name > a').text 
end 

hash # => {"date"=>"5/18/2014", "title"=>"Mockingbird"}

值得一读tap{|x|...} → obj

息率x来将挡，然后返回x。该方法的主要目的是“挖掘”方法链，以便对链中的中间结果执行操作。

和str[regexp] → new_str or nil。

也读CSS selectors了解选择器是如何与方法#css。

来源

2014-05-18 05:04:51

感谢您的回答！但是，我将如何改变这一点，以便它无限期地收集最新情节（因为它会自动从季节4到5而不必更改代码）。 – HarryLucas

我不知道上面的代码是如何工作的，但是你可以使用.next_episode类来获得最新的情节相关的HTMl，并且在它里面你会得到相关的信息。 – amitamb

@HarryLucas好的。网页上的数据正在通过* AJAX *加载，而不是由* Nokogiri *加载。为此，您可以使用* selenium-webdriver *执行* AJAX *调用，然后* nokogiri *来删除html页面并从那里获取数据。 –

Nokogiri刮脸问题

回答

相关问题