0
我尝试抓取viagogo.com 我想抓取从页面中的每个显示: http://www.viagogo.com/Concert-Tickets/Rock-and-Pop IM能拿到第一页上显示,但是当我尝试移动下一个页面就只是不爬行! 这里是我的代码:scrapy递归的履带问题
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from viagogo.items import ViagogoItem
from scrapy.http import Request, FormRequest
class viagogoSpider(CrawlSpider):
name="viagogo"
allowed_domains=['viagogo.com']
start_urls = ["http://www.viagogo.com/Concert-Tickets/Rock-and-Pop"]
rules = (
# Running on pages
Rule(SgmlLinkExtractor(restrict_xpaths=('//*[@id="clientgridtable"]/div[2]/div[2]/div/ul/li[7]/a')), callback='Parse_Page', follow=True),
# Running on artists in title
Rule(SgmlLinkExtractor(restrict_xpaths=('//*[@id="clientgridtable"]/table/tbody')), callback='Parse_artists_Tickets', follow=True),
)
#all_list = response.xpath('//a[@class="t xs"]').extract()
def Parse_Page(self, response):
item = ViagogoItem()
item["title"] = response.xpath('//title/text()').extract()
item["link"] = response.url
print 'Page!' + response.url
yield Request(url=response.url, meta={'item': item}, callback=self.Parse_Page)
def Parse_artists_Tickets(self, response):
item = ViagogoItem()
item["title"] = response.xpath('//title/text()').extract()
item["link"] = response.url
print response.url
with open('viagogo_output', 'a') as f:
f.write(str(item["title"]) + '\n')
return item
我不明白什么即时通讯做错了,但输出(在文件中)只有第一页显示..
的感谢!
我没有得到它..当我得到第一个响应,response.url是在下一页。所以响应不一样 – SomeNiceGuy21 2014-12-13 12:55:58
@ SomeNiceGuy21“response.url”是请求的URL,它发起并传送当前响应对象作为参数传递。当你做'产生请求(url = response.url,...)'时,你正在安排一个请求来重新创建同一个URL - 这个会被跳过。这有帮助吗? – elias 2014-12-13 15:41:28
不完全。我意识到'下一个'按钮正在调用JS函数,但我不知道如何调用它..你能帮忙吗? – SomeNiceGuy21 2014-12-13 15:45:49