2013-07-30 68 views
0

我正在解析评级网站以查明给定公司具有哪些评级。Nokogiri /机械化提取div内容?

评级可以15之间变化,并且它们都可以与此代码提取:

a = Mechanize.new 
page = a.get(url) 
reviews = page.search(".reviewcontent") 
reviews.each do |r| 
    rating = r.at_css(".s1, .s2, .s3, .s4, .s5") 
    puts rating   # => <span class="s5" itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> 
           <meta itemprop="worstRating" content="1"> 
          <meta itemprop="bestRating" content="5"> 
          <meta itemprop="ratingValue" content="5"></span> 
    puts rating.inspect # => #<Nokogiri::XML::Element:0x3fe0e108783c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fe0e1087440 name="class" value="s5">, #<Nokogiri::XML::Attr:0x3fe0e108742c name="itemprop" value="reviewRating">, #<Nokogiri::XML::Attr:0x3fe0e1087404 name="itemscope">, #<Nokogiri::XML::Attr:0x3fe0e10873dc name="itemtype" value="http://schema.org/Rating">] children=[#<Nokogiri::XML::Text:0x3fe0e108648c "\r\n   ">, #<Nokogiri::XML::Element:0x3fe0e108634c name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fe0e108625c name="itemprop" value="worstRating">, #<Nokogiri::XML::Attr:0x3fe0e1086248 name="content" value="1">]>, #<Nokogiri::XML::Element:0x3fe0e10898bc name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fe0e10897cc name="itemprop" value="bestRating">, #<Nokogiri::XML::Attr:0x3fe0e10897b8 name="content" value="5">]>, #<Nokogiri::XML::Element:0x3fe0e1088b10 name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fe0e1088994 name="itemprop" value="ratingValue">, #<Nokogiri::XML::Attr:0x3fe0e1088980 name="content" value="5">]>]> 
end 

我对这一行:<meta itemprop="ratingValue" content="5">content具体的vaule在这种情况下是5

如何提取此值?

编辑:

puts reviews.to_html给出了这样的结果:

<div class="reviewcontent"> 
    <p class="r-m "> 
     <span class="s5" itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> 
      <meta itemprop="worstRating" content="1"> 
<meta itemprop="bestRating" content="5"> 
<meta itemprop="ratingValue" content="5"></span> 
    </p> 


<time datetime="2011-09-15T18:16:10.0000000+02:00" class="ndate strong" title="15. september 2011 - 18:16:10" pubdate> 
    15. september 2011 
    <span title="2011-09-15T18:16:10.0000000+02:00"></span> 
</time><meta itemprop="dateCreated" content="2011-09-15T18:16:10.0000000+02:00"> 
<h3 itemprop="headline" class="summary da"> 
      <a href="http://www.trustpilot.dk/review/scandicfly.dk/4e7240ea00006400020e3b0e" class="showReview">Tip Top</a> 
     </h3> 
     <p itemprop="reviewBody"> 
      Bestilte en del fluer, en krogskærper og andre småting.<br>Kom 3 dage efter bestilling og alt var, som det skulle. 
     </p> 
     <span class="imagezoom"> 

     </span> 
     <div class="actions"> 

      <input type="hidden" name="ReviewId" value="4e7240ea00006400020e3b0e"><input type="hidden" name="UserName" value="Strit"><a href="http://www.trustpilot.dk/review/scandicfly.dk/4e7240ea00006400020e3b0e#allcomments" class="comments fb-comments-label" id="FB-comment-box-0"> 
         <span></span> 
         Kommentar (<comments-count  href="http://trustpilot.com/review/scandicfly.dk#4e7240ea00006400020e3b0e">?</comments-count>) 
       </a> 
       <a class="useful" data-reviewid="4e7240ea00006400020e3b0e" href="#"><span> </span> 
        Find nyttig 
       </a> 

       <a class="replyAsCompany" href="#"><span></span> 
        Svar som firma 
       </a> 

       <a class="report" data-reviewid="932622" href="#"><span></span> 
        Rapportér 
       </a> 

     </div> 
     <div class="fb-comments-wrapper"> 
      <div class="social-guidelines"><a href="/social">Sociale retningslinjer</a></div> 
     </div> 
      <div class="companyComments" id="CompanyComments_932622"> 
      <div class="companyComments" id="CompanyComments_4e7240ea00006400020e3b0e">  
      </div> 
     </div> 

    </div><div class="reviewcontent"> 
    <p class="r-m "> 
     <span class="s5" itemprop="reviewRating" itemscope  itemtype="http://schema.org/Rating"> 
      <meta itemprop="worstRating" content="1"> 
<meta itemprop="bestRating" content="5"> 
<meta itemprop="ratingValue" content="5"></span> 
    </p> 


<time datetime="2011-04-05T16:05:06.0000000+02:00" class="ndate" title="5. april 2011 - 16:05:06" pubdate> 
    5. april 2011 
    <span title="2011-04-05T16:05:06.0000000+02:00"></span> 
</time><meta itemprop="dateCreated" content="2011-04-05T16:05:06.0000000+02:00"> 
<h3 itemprop="headline" class="summary da"> 
      <a  href="http://www.trustpilot.dk/review/scandicfly.dk/4d9b3db2000064000209035f"  class="showReview">en god og flot oplevelse</a> 
     </h3> 
     <p itemprop="reviewBody"> 
      Købte en fiskestang hos ScandicFly. Faktra ordrebekræftigelse og det hele  præsenteret meget flot. Der kom desuden et notis om min fiskestang var afsendt.<br>Et par dage efter kom min fiskestang med posten forsvarligt pakket ind. 
     </p> 
     <span class="imagezoom"> 

     </span> 
     <div class="actions"> 

      <input type="hidden" name="ReviewId" value="4d9b3db2000064000209035f"><input type="hidden" name="UserName" value="Peter Leter"><a href="http://www.trustpilot.dk/review/scandicfly.dk/4d9b3db2000064000209035f#allcomments" class="comments fb-comments-label" id="FB-comment-box-1"> 
        <span></span> 
        Kommentar (<comments-count  href="http://trustpilot.com/review/scandicfly.dk#4d9b3db2000064000209035f">?</comments-count>) 
       </a> 
       <a class="useful" data-reviewid="4d9b3db2000064000209035f" href="#"><span></span> 
        Find nyttig 
       </a> 

       <a class="replyAsCompany" href="#"><span></span> 
        Svar som firma 
       </a> 

       <a class="report" data-reviewid="590687" href="#"><span></span> 
        Rapportér 
       </a> 

     </div> 
     <div class="fb-comments-wrapper"> 
      <div class="social-guidelines"><a href="/social">Sociale retningslinjer</a></div> 
     </div> 
     <div class="companyComments" id="CompanyComments_590687"> 
      <div class="companyComments" id="CompanyComments_4d9b3db2000064000209035f">  
      </div> 
     </div> 
+0

你可以给'.reviewcontent'内的html内容吗?只要'把reviews.to_html'和如何输出这里.. –

+0

是的Babai,我已经把它放在现在。 – ChristofferJoergensen

回答

3

您可以采取以下xpath后:

require 'nokogiri' 

doc = Nokogiri::HTML::Document.parse <<-_HTML_ 
<span class="s5" itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> 
     <meta itemprop="worstRating" content="1"> 
    <meta itemprop="bestRating" content="5"> 
    <meta itemprop="ratingValue" content="5"> 
</span> 
_HTML_ 

doc.at("//meta[@itemprop = 'bestRating']/@content").to_s 
# => "5" 

在你的情况,如下写:

r.at_css(".s1, .s2, .s3, .s4, .s5").at("//meta[@itemprop = 'bestRating']/@content").to_s 
+0

完美。一件小事:我想提取的是'ratingValue'。我用'ratingValue'取代了'bestRating',但不知何故它仍然给了我'bestRating'的价值。 – ChristofferJoergensen

+0

@ChristofferJoergensen他们都有相同的值'5',因此你看到相同的.. :) –

+0

对不起,然后我给了一个坏例子:-)在示例页面,我正在审查值之间的几个页面'HTTP:// www.trustpilot.dk /审查/ www.fona.dk')。 – ChristofferJoergensen

0

只是为了清理八佰的回答了一下,怎么样:

doc.at('meta[itemprop="bestRating"]')[:content] 

其实你可能只是:

rating[:class][/\d/] 

知道为什么吗?