Scrapy不能刮网站

我已经上了几天，但我仍然无法找到答案。我正在使用scrapy（python）来刮this webpage。Scrapy不能刮网站

我这里还有我的目录：

hotels/ 
|_ scrapy.cfg 
|_ hotels/ 
    |_ __init__.py 
    |_ items.py 
    |_ pipelines.py 
    |_ settings.py 
    |_ spiders/ 
    |_ __init__.py 
    |_ hotels_spyder.py

items.py的内容

from scrapy.item import Item, Field 

class HotelsItem(Item): 
    nameHotel = Field() 
    idHotel = Field()

hotels_spyder.py的内容

from scrapy.spider import BaseSpider 
from scrapy.selector import Selector 

from hotels.items import HotelsItem 

class HotelsSpider(BaseSpider): 
name = "hotels" 
allowed_domains = ["hotels.com"] 
start_urls = ["http://fr.hotels.com/search.do?destination=New+York&arrivalDate=13%2F04%2F2015&departureDate=15%2F04%2F2015&rooms=1&children%5B0%5D=2&searchParams.rooms%5B0%5D.numberOfAdults=2&searchParams.rooms%5B0%5D.childrenAges%5B0%5D=7&searchParams.rooms%5B0%5D.childrenAges%5B1%5D=7&searchParams.landmark=&searchParams.resolvedLocation=CITY%3A1506246%3AEXACT_RED%3AHIGH&destinationId="] 

def parse(self, response): 
    sel = Selector(response) 
    sites = sel.xpath('//h3[@class="hotel-name"]') 
    items = [] 
    for site in sites: 
     item = HotelsItem() 
     type(item) 
     item['nameHotel'] = site.xpath('a/text()').extract() 
     item['idHotel'] = site.xpath('a/@id').extract() 
     items.append(item) 
    return items

settings.py

BOT_NAME = 'hotels' 

SPIDER_MODULES = ['hotels.spiders'] 
NEWSPIDER_MODULE = 'hotels.spiders'

的

内容，所以这一切工作正常。它做我想要的东西（仍然需要清理空间和东西）。

但我最终的目标是刮美国版的网站。所以我试图用这个替换名为“start_urls”的列表，我在hotels_spyder.py中有：http://www.hotels.com/search.do?destination=New+York&arrivalDate=03%2F18%2F15&departureDate=03%2F20%2F15&rooms=1&children[0]=2&searchParams.rooms[0].numberOfAdults=2&searchParams.rooms[0].childrenAges[0]=7&searchParams.rooms[0].childrenAges[1]=7&searchParams.landmark=&searchParams.resolvedLocation=CITY%3A1506246%3AEXACT_RED%3AHIGH&destinationId=

而且它不起作用。我检查了这两个链接的源代码，它是一样的。我真的不知道为什么它不起作用，一个星期以来一直让我疯狂。

谢谢你在前进，菲利普

来源

2013-12-10 l3aronsansgland

它是如何不工作？你会得到什么错误？ –

目前无法正常工作，因为作为链接提供的查询指定了今天之前的到达日期，因此页面会显示错误页面。 –

你是对的，这是一个旧版本，我编辑了我的问题。它仍然不起作用。重点是我没有得到任何错误，但我也没有任何输出。 – l3aronsansgland

我把你的代码，并检查它是否正常。最后我意识到你的start_urls对于的英文版应该有所不同。

您使用的网址为http://www.hotels.com ...。为了获得该网站的英文版，你需要正确的前缀。在法语版本的抓取中，它是fr。英文版是uk。

请尝试以下操作start_urls。它工作在我的爬虫：

start_urls = ['http://uk.hotels.com/search.do?destination=New+York&arrivalDate=13%2F04%2F2015&departureDate=15%2F04%2F2015&rooms=1&children[0]=2&searchParams.rooms[0].numberOfAdults=2&searchParams.rooms[0].childrenAges[0]=7&searchParams.rooms[0].childrenAges[1]=7&searchParams.landmark=&searchParams.resolvedLocation=CITY%3A1506246%3AEXACT_RED%3AHIGH&destinationId=']

来源

2013-12-11 08:35:50 Jon

其实我需要得到价格在美元。而且它唯一的地方是美元，显然是http://www.hotels.com/。

混乱的部分是，它的工作对http://fr.hotels.com或uk.hotels.com，但不是在美国版本http://www.hotels.com

来源

2013-12-11 18:40:14 l3aronsansgland

Scrapy不能刮网站

回答

相关问题