2016-05-12 141 views
0

我们如何得到本网站的所有图片:http://www.theft-alerts.com 我们需要19页的图片。播种远,我们有这个代码,但它不工作。我们想要在新地图中的图像。如何从网站上刮取图片?

#!/usr/bin/python 

import [urllib2][1] 
from bs4 import BeautifulSoup 
from urlparse import urljoin 

url = "http://www.theft-alerts.com/index-%d.html" 
page = urllib2.urlopen(url).read() 
soup = BeautifulSoup(page, "html.parser") 

base = "http://www.theft-alerts.com" 

images = [urljoin(base,a["href"]) for a in soup.select("td a[href^=images/]")] 

for url in images: 
    img = BeautifulSoup(urllib2.urlopen(url).read(),"lxml").find("img")["src"] 
with open("myimages/{}".format(img), "w") as f: 
    f.write(urllib2.urlopen("{}/{}".format(url.rsplit("/", 1)[0], img)).read()) 
+2

“它不工作”你知道为什么吗?至少,你的url包含一个你还没有填写的参数。 –

回答

0

你需要遍历每个页面并提取图像,可以不断循环,直到与文本"Next"锚与类resultnav代码标签:

import requests 

from bs4 import BeautifulSoup 
from urlparse import urljoin 

def get_pages(start): 
    soup = BeautifulSoup(requests.get(start).content) 
    images = [img["src"] for img in soup.select("div.itemspacingmodified a img")] 
    yield images 
    nxt = soup.select("code.resultnav a")[-1] 
    while True: 
     soup = BeautifulSoup(requests.get(urljoin(url, nxt["href"])).content) 
     nxt = soup.select("code.resultnav a")[-1] 
     if nxt.text != "Next": 
      break 
     yield [img["src"] for img in soup.select("div.itemspacingmodified a img")] 




url = "http://www.theft-alerts.com/" 

for images in get_pages(url): 
    print(images) 

,让你来自所有19页的图像。