2015-04-12 50 views
0

我想在Python中将一个变量设置为一个数组中的字符串元素,这是基于另一个数组中使用的字符串元素。我很难过如何去做。在python和scrapy中检查另一个数组与另一个数组

这里有两个阵列:

genre = ["Dance", 
    "Festivals", 
    "Rock/pop" 
    ] 

我试图基于在另一个阵列即这三个要素来打印类型时start_urls = [0],流派= [0]:

start_urls = [ 
    "http://www.allgigs.co.uk/whats_on/London/clubbing-1.html", 
    "http://www.allgigs.co.uk/whats_on/London/festivals-1.html", 
    "http://www.allgigs.co.uk/whats_on/London/tours-1.html" 
] 

全码:

genre = ["Dance", 
    "Festivals", 
    "Rock/pop" 
    ] 

class AllGigsSpider(CrawlSpider): 
    name = "allGigs" # Name of the Spider. In command promt, when in the correct folder, enter "scrapy crawl Allgigs". 
    allowed_domains = ["www.allgigs.co.uk"] # Allowed domains is a String NOT a URL. 
    start_urls = [ 
     "http://www.allgigs.co.uk/whats_on/London/clubbing-1.html", 
     "http://www.allgigs.co.uk/whats_on/London/festivals-1.html", 
     "http://www.allgigs.co.uk/whats_on/London/tours-1.html" 
    ] 

    rules = [ 
     Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'), # Search the start URL's for 
     callback="parse_item", 
     follow=True), 
    ] 

    def parse_start_url(self, response): 
     return self.parse_item(response) 

    def parse_item(self, response):#http://stackoverflow.com/questions/15836062/scrapy-crawlspider-doesnt-crawl-the-first-landing-page 
     for info in response.xpath('//div[@class="entry vevent"]'): 
      item = TutorialItem() # Extract items from the items folder. 
      item ['artist'] = info.xpath('.//span[@class="summary"]//text()').extract() # Extract artist information. 
      item ['date'] = info.xpath('.//span[@class="dates"]//text()').extract() # Extract date information. 
      preview = ''.join(str(s)for s in item['artist']) 
      #item ['genre'] = i.xpath('.//li[@class="style"]//text()').extract() 
      client = soundcloud.Client(client_id='401c04a7271e93baee8633483510e263', client_secret='b6a4c7ba613b157fe10e20735f5b58cc', callback='http://localhost:9000/#/callback.html') 
      tracks = client.get('/tracks', q = preview, limit=1) 
      for track in tracks: 
       print track.id 
       for i, val in enumerate(genre): 
         print '{} {}'.format(genre[i], start_urls[i]) 
       print genre 
       #for i, val in enumerate(genre): 
       #  print '{} {}'.format(genre[i], start_urls[i]) 
       item ['trackz'] = track.id 
       yield item 

任何帮助表示赞赏。

+0

如果你想映射两个数组你可以使用'dicts'? – Zero

+0

把你的预期输出\ – itzMEonTV

+0

我的预期输出只是将项目['流派']设置为与被抓取的链接相对应的任何内容。所以第一个url只会发送一个字符串“跳舞”到我的数据库 –

回答

0
for i, val in enumerate(genre): 
    print '{} {}'.format(genre[i], start_urls[i]) 

应该工作

+0

我得到一个关于全局变量'start_urls'不存在的错误。我将用完整的代码编辑问题....并且谢谢:) –

+0

你的start_urls是一个类的属性,所以你必须使用self,像这样self.start_urls [i] –

+0

这真棒,工作更好。但是,这会打印出所有三种流派和三个网址。我只是想打印匹配被刮的网址的流派,如果这是有道理的? –

相关问题