2015-11-03 26 views
4

我一直试图使用Scrapy(xpath)从Kbb的HTML中的脚本标记中提取数据。但我的主要问题是识别正确的div和脚本标签。我是新来的使用XPath,并会感谢任何帮助!从使用Scrapy的HTML中获取<script>标记的数据

HTML(http://www.kbb.com/nissan/altima/2014/25-s-sedan-4d/?vehicleid=392396&intent=buy-used&mileage=10000&condition=fair&pricetype=retail):

<script type="text/javascript" src="http://s1.kbb.com/combine/IncentivesPilotJs/949332058"></script> 
     <input type="hidden" id="ResaleValueUrl" value="/ymmt/resalevalue/?vehicleid=392396" /> 
     <input type="hidden" id="Intent" value="buy-used" /> 
     <!--[if lt IE 9]> 
      <script> 
      window.FlashCanvasOptions = { 
       swfPath: "/js/canvas/FlashCanvas/UCMarketMeter/" 
      }; 
      </script> 
      <script type="text/javascript" src="http://s1.kbb.com/combine/YmmtMarketMeterFlashCanvasJs/795892638"></script> 
     <![endif]--> 
     <script type="text/javascript" src="http://s1.kbb.com/combine/YMMTOverview/1527402533"></script> 
     <script type="text/javascript" src="http://s1.kbb.com/combine/YmmtPricingOverviewBuyUsedJs/-1416499456"></script> 

     <script language="javascript" type="text/javascript"> 
      $(document).ready(function() { 
       KBB.Vehicle.Pages.PricingOverview.Buyers.setup({ 
        //Workaround until we get cross domain working for Flash 
        imageDir: window.FlashCanvasOptions ? "/Content/images" : "http://file.kelleybluebookimages.com/kbb/images/marketmeter", 
        vehicleId: "392396", 
        zipCode: "78701", 
        mileage: "10000", 
        intent: "buy-used", 
        priceType: "retail", 
        condition: "good", 
        options: "392396|53635|78701|100|10|", 
        price: "17074", 
        manufacturer: "Nissan", 
        model: "Altima", 
        year: "2014", 
        style: "2.5 S Sedan 4D", 
        category: "", 
        hasCpo: true, 
        meetsCpoReq: true, 
        showOthersPaid: false, 
        data: { 
    "values": { 
    "cpo": { 
     "priceMin": 17335.0, 
     "price": 18275.0, 
     "priceMax": 19214.0 
    }, 
    "fpp": { 
     "priceMin": 15286.0, 
     "price": 17074.0, 
     "priceMax": 18861.0 
    }, 
    "privatepartyexcellent": { 
     "priceMin": 0.0, 
     "price": 16064.0, 
     "priceMax": 0.0 
    }, 
    "privatepartyfair": { 
     "priceMin": 0.0, 
     "price": 14081.0, 
     "priceMax": 0.0 
    }, 
    "privatepartygood": { 
     "priceMin": 0.0, 
     "price": 15454.0, 
     "priceMax": 0.0 
    }, 
    "privatepartyverygood": { 
     "priceMin": 0.0, 
     "price": 15715.0, 
     "priceMax": 0.0 
    }, 
    "retail": { 
     "priceMin": 0.0, 
     "price": 17875.0, 
     "priceMax": 0.0 
    } 
    }, 
    "timAmount": 0.0, 
    "monthlyPayments": { 
    "cpo": { 
     "vehiclePrice": 18275.0, 
     "rate": 2.9, 
     "terms": 60.0, 
     "taxAndTitle": 6.5, 
     "downPay": 0.0, 
     "amount": 348.0 
    }, 
    "fpp": { 
     "vehiclePrice": 17074.0, 
     "rate": 4.9, 
     "terms": 60.0, 
     "taxAndTitle": 6.5, 
     "downPay": 0.0, 
     "amount": 342.0 
    }, 
    "privatepartyexcellent": { 
     "vehiclePrice": 16064.0, 
     "rate": 4.9, 
     "terms": 60.0, 
     "taxAndTitle": 6.5, 
     "downPay": 0.0, 
     "amount": 322.0 
    }, 
    "privatepartyfair": { 
     "vehiclePrice": 14081.0, 
     "rate": 4.9, 
     "terms": 60.0, 
     "taxAndTitle": 6.5, 
     "downPay": 0.0, 
     "amount": 282.0 
    }, 
    "privatepartygood": { 
     "vehiclePrice": 15454.0, 
     "rate": 4.9, 
     "terms": 60.0, 
     "taxAndTitle": 6.5, 
     "downPay": 0.0, 
     "amount": 309.0 
    }, 
    "privatepartyverygood": { 
     "vehiclePrice": 15715.0, 
     "rate": 4.9, 
     "terms": 60.0, 
     "taxAndTitle": 6.5, 
     "downPay": 0.0, 
     "amount": 315.0 
    }, 
    "retail": { 
     "vehiclePrice": 17875.0, 
     "rate": 4.9, 
     "terms": 60.0, 
     "taxAndTitle": 6.5, 
     "downPay": 0.0, 
     "amount": 358.0 
    } 
    }, 
    "scale": { 
    "scaleLow": 14081.0, 
    "scaleHigh": 19214.0 
    }, 
    "transactions": { 
    "below": 7, 
    "between": 17, 
    "above": 3 
    } 
}, 
        adPriceRanges: {"AdPriceRange":[{"PriceMin":0,"PriceMax":8499,"AdPRValue":1},{"PriceMin":8500,"PriceMax":18499,"AdPRValue":2},{"PriceMin":18500,"PriceMax":23499,"AdPRValue":3},{"PriceMin":23500,"PriceMax":28499,"AdPRValue":4},{"PriceMin":28500,"PriceMax":33499,"AdPRValue":5},{"PriceMin":33500,"PriceMax":38499,"AdPRValue":6},{"PriceMin":38500,"PriceMax":43499,"AdPRValue":7},{"PriceMin":43500,"PriceMax":48499,"AdPRValue":8},{"PriceMin":48500,"PriceMax":53499,"AdPRValue":9},{"PriceMin":53500,"PriceMax":63499,"AdPRValue":10},{"PriceMin":63500,"PriceMax":73499,"AdPRValue":11},{"PriceMin":73500,"PriceMax":1000000,"AdPRValue":12}]}}); 
      }); 
      $('.foot-note').hide(); 
      $(window).on('popstate', function() { 
       KBB.Vehicle.Pages.PricingOverview.Buyers.stateChangeHandler(); 
      }); 
     </script> 


Scrapy Code: 

from scrapy.spider import BaseSpider 
from scrapy.selector import Selector 
import scrapy 

from kbb.items import kbbItem 

class kbbSpider(scrapy.Spider): 
name = "kbb" 
allowed_domains = ["kbb.com"] 
start_urls = [ 
    "http://www.kbb.com/nissan/altima/2014/25-s-sedan-4d/?vehicleid=392396&intent=buy-used&10000&good&pricetype=retail" 
] 

def parse(self, response): 
    sel=Selector(response) 
    #sites=sel.xpath('//div') 
    items=[] 
    #for site in sites: 
    item=kbbItem 
    item['priceMin']=site.xpath('//div/script').extract[35][915:922] 
    return items 

我终于想priceMinpricepriceMaxfpp,价格从retail场填充到我的项目。目前我使用索引来获取这些值,但想知道是否有更简单的方法。

回答

5

问题是,所需的数据是在Javascript代码中。而且,你现在依靠线索引的方法是相当脆弱和不可靠的。

的想法是找到包含所需数据的script标签,使用regular expressions获得包含价格的对象/字典,对象加载到Python字典与json module的帮助和获得所需的信息。从Scrapy Shell

演示:

In [1]: import re 
In [2]: import json 

In [3]: pattern = re.compile(r"KBB\.Vehicle\.Pages\.PricingOverview\.Buyers\.setup\(.*?data: ({.*?}),\W+adPriceRanges", re.MULTILINE | re.DOTALL) 
In [4]: data = response.xpath("//script[contains(., 'KBB.Vehicle.Pages.PricingOverview.Buyers.setup')]/text()").re(pattern)[0] 

In [5]: data = data.replace("//Workaround until we get cross domain working for Flash", "") 

In [6]: data_obj = json.loads(data) 

In [7]: data_obj['values']['fpp'] 
Out[7]: {u'price': 15569.0, u'priceMax': 17356.0, u'priceMin': 13781.0} 

In [8]: data_obj['values']['retail'] 
Out[8]: {u'price': 16370.0, u'priceMax': 0.0, u'priceMin': 0.0} 
+0

感谢您的惊人回应@alecxe。你能不能也请解释如何改变“模式”(主要是正则表达式)变量,以获取在“数据”字典之上的里程数,条件,制造商,模型等数据。 – outlier123

+1

@ outlier123谢谢!如何使用'response.xpath(“//脚本[包含(。,'KBB.Vehicle.Pages.PricingOverview.Buyers.setup')]/text()”)。re(r'mileage:“(\ d + )”,')[0]'? – alecxe

+0

真棒!所以我们只需要找到正确的正则表达式。感谢您的回复@alecxe – outlier123

相关问题