2016-02-16 123 views
0

我想用Google图片搜索下载批量图片。用Python3刮去Google图片(请求+ BeautifulSoup)

我的第一种方法;将页面源文件下载到一个文件,然后用open()打开它可以正常工作,但我希望能够通过运行脚本和更改关键字来获取图像URL。

第一种方法:转到图像搜索(https://www.google.no/search?q=tower&client=opera&hs=UNl&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiM5fnf4_zKAhWIJJoKHYUdBg4Q_AUIBygB&biw=1920&bih=982)。在浏览器中查看页面源并将其保存为html文件。当我然后open()与脚本的HTML文件,该脚本按预期工作,我得到了搜索页上图像的所有网址的整齐列表。这是脚本的第6行(取消注释以测试)。

但是如果我使用requests.get()函数来解析网页,如图脚本的7号线,它取一个不同 html文件,不包含图像的完整URL,所以我不能提取他们。

请帮我提取正确的图像网址。

编辑:链接到tower.html,我使用:https://www.dropbox.com/s/yy39w1oc8sjkp3u/tower.html?dl=0

这是代码,我至今写:

import requests 
from bs4 import BeautifulSoup 

# define the url to be scraped 
url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982' 

# top line is using the attached "tower.html" as source, bottom line is using the url. The html file contains the source of the above url. 
#page = open('tower.html', 'r').read() 
page = requests.get(url).text 

# parse the text as html 
soup = BeautifulSoup(page, 'html.parser') 

# iterate on all "a" elements. 
for raw_link in soup.find_all('a'): 
    link = raw_link.get('href') 
    # if the link is a string and contain "imgurl" (there are other links on the page, that are not interesting... 
    if type(link) == str and 'imgurl' in link: 
     # print the part of the link that is between "=" and "&" (which is the actual url of the image, 
     print(link.split('=')[1].split('&')[0]) 

回答

0

只是让你意识到:

# http://www.google.com/robots.txt 

User-agent: * 
Disallow: /search 



我想说我的回答是说Google很大程度上依赖于脚本。您很可能会得到不同的结果,因为您通过reqeusts请求的页面对页面上提供的script s没有做任何操作,而在Web浏览器中加载该页面却可以。

Here's what i get when I request the url you supplied

我回来从requests.get(url).text的文本不包含在它'imgurl'任何地方。你的脚本正在寻找它作为其标准的一部分,它不在那里。

但是,我确实看到一堆<img>标签,src属性设置为图像url。如果这就是你以后,比试试这个脚本:

import requests 
from bs4 import BeautifulSoup 

url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982' 

# page = open('tower.html', 'r').read() 
page = requests.get(url).text 

soup = BeautifulSoup(page, 'html.parser') 

for raw_img in soup.find_all('img'): 
    link = raw_img.get('src') 
    if link: 
    print(link) 

它返回的结果如下:

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQyxRHrFw0NM-ZcygiHoVhY6B6dWwhwT4va727380n_IekkU9sC1XSddAg 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRfuhcCcOnC8DmOfweuWMKj3cTKXHS74XFh9GYAPhpD0OhGiCB7Z-gidkVk 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSOBZ9iFTXR8sGYkjWwPG41EO5Wlcv2rix0S9Ue1HFcts4VcWMrHkD5y10 
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTEAZM3UoqqDCgcn48n8RlhBotSqvDLcE1z11y9n0yFYw4MrUFucPTbQ0Ma 
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSJvthsICJuYCKfS1PaKGkhfjETL22gfaPxqUm0C2-LIH9HP58tNap7bwc 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQGNtqD1NOwCaEWXZgcY1pPxQsdB8Z2uLGmiIcLLou6F_1c55zylpMWvSo 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSdRxvQjm4KWaxhAnJx2GNwTybrtUYCcb_sPoQLyAde2KMBUhR-65cm55I 
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQLVqQ7HLzD7C-mZYQyrwBIUjBRl8okRDcDoeQE-AZ2FR0zCPUfZwQ8Q20 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQHNByVCZzjSuMXMd-OV7RZI0Pj7fk93jVKSVs7YYgc_MsQqKu2v0EP1M0 
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcS_RUkfpGZ1xJ2_7DCGPommRiIZOcXRi-63KIE70BHOb6uRk232TZJdGzc 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSxv4ckWM6eg_BtQlSkFP9hjRB6yPNn1pRyThz3D8MMaLVoPbryrqiMBvlZ 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQWv_dHMr5ZQzOj8Ort1gItvLgVKLvgm9qaSOi4Uomy13-gWZNcfk8UNO8 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRRwzRc9BJpBQyqLNwR6HZ_oPfU1xKDh63mdfZZKV2lo1JWcztBluOrkt_o 
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQdGCT2h_O16OptH7OofZHNvtUhDdGxOHz2n8mRp78Xk-Oy3rndZ88r7ZA 
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRnmn9diX3Q08e_wpwOwn0N7L1QpnBep1DbUFXq0PbnkYXfO0wBy6fkpZY 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSaP9Ok5n6dL5K1yKXw0TtPd14taoQ0r3HDEwU5F9mOEGdvcIB0ajyqXGE 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTcyaCvbXLYRtFspKBe18Yy5WZ_1tzzeYD8Obb-r4x9Yi6YZw83SfdOF5fm 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnS1qCjeYrbUtDSUNcRhkdO3fc3LTtN8KaQm-rFnbj_JagQEPJRGM-DnY0 
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSiX_elwJQXGlToaEhFD5j2dBkP70PYDmA5stig29DC5maNhbfG76aDOyGh 
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQb3ughdUcPUgWAF6SkPFnyiJhe9Eb-NLbEZl_r7Pvt4B3mZN1SVGv0J-s 
+0

我曾尝试使用urllib的,这主要是给了我“禁止”回刮,这是我相信是因为禁止,你提到。 urllib适用于除谷歌图像以外的任何内容。 我知道在请求解析的文本中没有“imgurl”-s。 你得到的结果是图像的缩略图。这比没有好,但我想收获全分辨率的图像。 问题是解析从不包含那个。有没有什么办法可以让请求遵循脚本,并且实际上让它获取源图像的​​地址? –

+0

这就是为什么它给你“禁止”回来。他们已经构建了一个完整的模块来解析网站的robots.txt文件,并确定是否允许抓取。您可以尝试使用're'库并使用正则表达式来查找值。但是,我认为Google的搜索页面很难找到......他们很难找到原因。 – ngoue

+0

无论如何,感谢编辑提取缩略图:) –