2017-05-31 49 views
1

我有很多关键词的数组:如何抓取多个关键字与蟒蛇icrawler

array = ['table', 'chair', 'pen'] 

我要抓取来自谷歌图片搜索5个图像的每个项目我array与蟒蛇icrawler

这里是初始化:

from icrawler.builtin import GoogleImageCrawler 

google_crawler = GoogleImageCrawler(
    parser_threads=2, 
    downloader_threads=4, 
    storage={ 'root_dir': 'images' } 
) 

我使用一个循环抓取每一个项目:

for item in array: 
    google_crawler.crawl(
    keyword=item, 
    offset=0, 
    max_num=5, 
    min_size=(500, 500) 
) 

但是,我得到的错误日志:

File "crawler.py", line 20, in <module> 
    min_size=(500, 500) 
    File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/site-packages/icrawler/builtin/google.py", line 83, in crawl 
    feeder_kwargs=feeder_kwargs, downloader_kwargs=downloader_kwargs) 
    File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/site-packages/icrawler/crawler.py", line 166, in crawl 
    self.feeder.start(**feeder_kwargs)         
    File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/site-packages/icrawler/utils/thread_pool.py", line 66, in start 
    worker.start()              
    File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/threading.py", line 842, in start 
    raise RuntimeError("threads can only be started once") 
RuntimeError: threads can only be started once 

这似乎是我不能使用超过一次google_crawler.crawl更多。我该如何解决这个问题?

回答

2

在最新版本中,您可以像这样使用它。

from icrawler.builtin import GoogleImageCrawler 

google_crawler = GoogleImageCrawler(
    parser_threads=2, 
    downloader_threads=4, 
    storage={'root_dir': 'images'} 
) 

for keyword in ['cat', 'dog']: 
    google_crawler.crawl(
     keyword=keyword, max_num=5, min_size=(500, 500), file_idx_offset='auto') 
    # set `file_idx_offset` to 'auto' will prevent naming the 5 images 
    # of dog from 000001.jpg to 000005.jpg, but naming it from 000006.jpg. 

,或者如果您希望将这些图像下载到不同的文件夹,您可以简单地创建两个GoogleImageCrawler实例。

from icrawler.builtin import GoogleImageCrawler 

for keyword in ['cat', 'dog']: 
    google_crawler = GoogleImageCrawler(
     parser_threads=2, 
     downloader_threads=4, 
     storage={'root_dir': 'images/{}'.format(keword)} 
    ) 
    google_crawler.crawl(
     keyword=keyword, max_num=5, min_size=(500, 500)) 
+1

ps:您可以在Github上提出一个问题以加快响应速度。 –