2013-06-25 109 views
2

我试图抓取网站抓取特定的XPath,从产品页面我试图报废产品的说明,而是如何选择只产品说明:如何限制蜘蛛使用scrapy

link to page

xPath : hxs.select('//div[@class="product-shop"]/p/text()').extract() 

的HTML是相当大的,所以请参见上面指定的链接..

我想只需要选择产品说明中没有其他细节...

如果我这样做:

[" ".join([i.strip() for i in hxs.select('//div[@class="product-shop"]/p/text()').extract()])] 

output : 
[u'Itemcode: 12BTS28271 Brand: BASICS InStock - Ships within 2 business days. Tip: 90% of our shipments reach within 4 business days! This product is part of the Basics T.shirts line made of 100% Cotton. Stripes Muscle Fit T.shirts that come in Green Color. Casual that comes with Henley away.'] 

但我只想:在镀铬元素面板中的元素

[u'This product is part of the Basics T.shirts line made of 100% Cotton. Stripes Muscle Fit T.shirts that come in Green Color. Casual that comes with Henley away.'] 
+1

是否有任何正则表达式或东西,以避免不必要的xPath –

回答

2

Rightclicking告诉我:

enter image description here

//*[@id="product_addtocart_form"]/div[2]/div[1]/p[3] 

指向

<p>This product is part of the Basics T.shirts line made of 100% Cotton.<br> 
         Stripes Muscle Fit T.shirts that come in Green Color.<br> 
         Casual that comes with Henley away.</p> 

试穿this page相同XPATH还指出,说明有太多:

<p>This product is part of the Basics Shirts line made of 100% Cotton.<br> 
        Plain Slim Fit Shirts that come in Orange Color.<br> 
        Casual that comes with Button Down away.</p> 

因此,它看起来像所有你需要做的是调用页面上的XPath和你设置。您仍然应该验证XPATH在所有情况下都能正常工作,因为它总是容易根据页面而改变。

+0

谢谢你,我不知道,xPath也可以写成'div [2]'..that ....感谢 –

+0

@ user2217267乐意帮忙! – TankorSmash