2015-10-17 75 views
0

我想扫描一个网站并下载其中的图像。 例如,对于像这样的网站URL:a.example.com/2vZBkE.jpg,我需要一个机器人从a.example.com/aaaaaa.jpga.example.com/AAAAAA.jpga.example.com/999999.jpg,如果有图像,请保存URL或下载图像。扫描图片的网址格式?

我尝试过使用Python和Scrapy,但我对它很陌生。 这是据我可以去:

import scrapy 

from scrapy.contrib.spiders import Rule, CrawlSpider 
from scrapy.contrib.linkextractors import LinkExtractor 
from example.items import ExampleItem 

class exampleSpider(CrawlSpider): 
    name = 'example' 
    allowed_domains = ['example.com'] 
    start_urls = ['http://a.example.com/2vZBkE'] 
    #rules = [Rule(LinkExtractor(allow=['/.*']),'parse_example')] 
    rules = (Rule(SgmlLinkExtractor(allow=('\/%s\/.*',)), callback='parse_example'), 
    ) 

    def parse_example(self,response): 
     image = ExampleItem() 
     image['title']=response.xpath(\ 
      "//h5[@id='image-title']/text()").extract() 

     rel = response.xpath("//img/@src").extract() 
     image ['image_urls'] = ['http:'+rel[0]] 
     return image 

我想我需要改变这一行:

rules = (Rule(SgmlLinkExtractor(allow=('\/%s\/.*',)), callback='parse_example'), 
     ) 

以某种方式限制%s 6个字符,并Scrapy尝试可能的组合。有任何想法吗?

+0

因此,您想以“a.example.com/ {id} .jpg”的形式下载所有图像吗? – Chaker

+0

是的。我需要检查该ID是否有图像,然后下载它。 – Scraper

回答

0

我不知道Scrapy。

“= \:但是你可以用requestsitertools

from string import ascii_letters, digits 
from itertools import product 
import requests 

# You need to implement this function to download images 
# check this http://stackoverflow.com/questions/13137817 
def download_image(url): 
    print url 

def check_link(url): 
    r = requests.head(url) 
    if r.status_code == 200: 
     return True 
    return False 

# Check all possible combinations and check them 
def generate_links(): 
    base_url = "http://a.example.com/{}.jpg" 
    for combination in product(ascii_letters+digits, repeat=5): 
     url = base_url.format("".join(combination)) 
     if check_link(url): 
      download_image(url) 
+0

你确定这工作?我更改网址并运行,但没有发生任何事情。 – Scraper

0

提取像 HREF = “a.example.com/123456.jpg”

使用下面的正则表达式的链接做“(\ S +/[\ w \ d] {6} .jpg)”