2016-03-05 59 views
6

这个问题已经被无数次问过,但所有的答案都至少一对夫妇岁,目前基于该ajax.googleapis.com API,它不再支持。如何下载谷歌图片搜索结果在Python

有没有人知道另一种方式?我试图下载一百个左右的搜索结果,除了Python API之外,我尝试了许多基于桌面,基于浏览器或者浏览器插件的程序,这样做都失败了。

谢谢!

+1

你试过硒? –

+0

“google图片搜索结果”是什么意思? – wong2

+0

硒解决了它!我使用了代码https://simplypython.wordpress.com/2015/05/18/saving-images-from-google-search-using-selenium-and-python/,并对滚动代码进行了轻微更改。 (直接跳转到页面的底部不*必然会导致一个懒惰的页面加载所有的图像,所以我让它逐渐滚动。) – xanderflood

回答

4

使用Google Custom Search你想要达到的目标。 见@ i08in的回答“Python - Download Images from google Image search?”它有很大的描述,脚本示例和库引用。

祝你好运!

+0

我接受这个,因为它肯定回答这个问题!我也想要指出的是,Google的API有一些限制,例如禁止用户使用它们,以便在我尝试执行时自动收集搜索结果,因此这种方法可能会遇到许可问题。 @摩根G的使用硒的建议对我很好! – xanderflood

0

您需要使用自定义搜索API。这里有一个方便的explorer。我使用urllib2。您还需要从开发者控制台为您的应用程序创建一个API密钥。

2

我一直在使用这个脚本下载从谷歌搜索图像和我一直在使用他们对我trainig我的分类 下面的代码可以下载相关的查询100个图像

from bs4 import BeautifulSoup 
import requests 
import re 
import urllib2 
import os 
import cookielib 
import json 

def get_soup(url,header): 
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser') 


query = raw_input("query image")# you can change the query for the image here 
image_type="ActiOn" 
query= query.split() 
query='+'.join(query) 
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch" 
print url 
#add the directory for your image here 
DIR="Pictures" 
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36" 
} 
soup = get_soup(url,header) 


ActualImages=[]# contains the link for Large original images, type of image 
for a in soup.find_all("div",{"class":"rg_meta"}): 
    link , Type =json.loads(a.text)["ou"] ,json.loads(a.text)["ity"] 
    ActualImages.append((link,Type)) 

print "there are total" , len(ActualImages),"images" 

if not os.path.exists(DIR): 
      os.mkdir(DIR) 
DIR = os.path.join(DIR, query.split()[0]) 

if not os.path.exists(DIR): 
      os.mkdir(DIR) 
###print images 
for i , (img , Type) in enumerate(ActualImages): 
    try: 
     req = urllib2.Request(img, headers={'User-Agent' : header}) 
     raw_img = urllib2.urlopen(req).read() 

     cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1 
     print cntr 
     if len(Type)==0: 
      f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb') 
     else : 
      f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb') 


     f.write(raw_img) 
     f.close() 
    except Exception as e: 
     print "could not load : "+img 
     print e 
2

下载任何使用Selenium Google图片搜索的图片数量:

from selenium import webdriver 
from selenium.webdriver.common.keys import Keys 
import os 
import json 
import urllib2 
import sys 
import time 

# adding path to geckodriver to the OS environment variable 
# assuming that it is stored at the same path as this script 
os.environ["PATH"] += os.pathsep + os.getcwd() 
download_path = "dataset/" 

def main(): 
    searchtext = sys.argv[1] # the search query 
    num_requested = int(sys.argv[2]) # number of images to download 
    number_of_scrolls = num_requested/400 + 1 
    # number_of_scrolls * 400 images will be opened in the browser 

    if not os.path.exists(download_path + searchtext.replace(" ", "_")): 
     os.makedirs(download_path + searchtext.replace(" ", "_")) 

    url = "https://www.google.co.in/search?q="+searchtext+"&source=lnms&tbm=isch" 
    driver = webdriver.Firefox() 
    driver.get(url) 

    headers = {} 
    headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" 
    extensions = {"jpg", "jpeg", "png", "gif"} 
    img_count = 0 
    downloaded_img_count = 0 

    for _ in xrange(number_of_scrolls): 
     for __ in xrange(10): 
      # multiple scrolls needed to show all 400 images 
      driver.execute_script("window.scrollBy(0, 1000000)") 
      time.sleep(0.2) 
     # to load next 400 images 
     time.sleep(0.5) 
     try: 
      driver.find_element_by_xpath("//input[@value='Show more results']").click() 
     except Exception as e: 
      print "Less images found:", e 
      break 

    # imges = driver.find_elements_by_xpath('//div[@class="rg_meta"]') # not working anymore 
    imges = driver.find_elements_by_xpath('//div[contains(@class,"rg_meta")]') 
    print "Total images:", len(imges), "\n" 
    for img in imges: 
     img_count += 1 
     img_url = json.loads(img.get_attribute('innerHTML'))["ou"] 
     img_type = json.loads(img.get_attribute('innerHTML'))["ity"] 
     print "Downloading image", img_count, ": ", img_url 
     try: 
      if img_type not in extensions: 
       img_type = "jpg" 
      req = urllib2.Request(img_url, headers=headers) 
      raw_img = urllib2.urlopen(req).read() 
      f = open(download_path+searchtext.replace(" ", "_")+"/"+str(downloaded_img_count)+"."+img_type, "wb") 
      f.write(raw_img) 
      f.close 
      downloaded_img_count += 1 
     except Exception as e: 
      print "Download failed:", e 
     finally: 
      print 
     if downloaded_img_count >= num_requested: 
      break 

    print "Total downloaded: ", downloaded_img_count, "/", img_count 
    driver.quit() 

if __name__ == "__main__": 
    main() 

完整代码是here

+0

不工作,你能修改吗? –

+0

你能告诉你的错误吗? – atif93

+1

我已经更改了代码,它现在应该可以工作。 – atif93