Python的BeautifulSoup网页图像抓取器IO错误：[错误2]没有这样的文件或目录

我写了下面的Python代码从网站www.style.comPython的BeautifulSoup网页图像抓取器IO错误：[错误2]没有这样的文件或目录

import urllib2, urllib, random, threading 
from bs4 import BeautifulSoup 
import sys 
reload(sys) 
sys.setdefaultencoding('utf-8') 

class Images(threading.Thread): 
    def __init__(self, lock, src): 
    threading.Thread.__init__(self) 
    self.src = src 
    self.lock = lock 

    def run(self): 
    self.lock.acquire() 
    urllib.urlretrieve(self.src,'./img/'+str(random.choice(range(9999)))) 
    print self.src+'get' 
    self.lock.release() 

def imgGreb(): 
    lock = threading.Lock() 
    site_url = "http://www.style.com" 
    html = urllib2.urlopen(site_url).read() 
    soup = BeautifulSoup(html) 
    img=soup.findAll(['img']) 
    for i in img: 
    print i.get('src') 
    Images(lock, i.get('src')).start() 

if __name__ == '__main__': 
    imgGreb()

抓取图像，但我得到这个错误：

IOError: [Errno 2] No such file or directory: '/images/homepage-2013-october/header/logo.png'

如何解决？

也可以递归地找到网站中的所有图像？我的意思是其他图像不在主页上。

谢谢！

来源

2013-11-03 randomp

你提到的错误是无处代码。 – aIKid

你应该发布由python –

当您尝试检索URL时，您正在使用没有域的相对路径。
某些图像是基于javascript的，你会得到相对路径为javascript:void(0);，你永远不会得到该页面。我添加了try except以解决该错误。或者，您可以巧妙地检测URL是否以jpg/gif/png结尾。我会为你工作:)
顺便说一句，并非所有的图像都包含在URL中，一些图片，美丽的一个，使用Javascript调用，将没有什么我们可以使用urllib和beautifulsoup只能做。如果你真的想挑战自己，也许你可以尝试学习Selenium，这是一个更强大的工具。

下面直接尝试代码：

import urllib2 
from bs4 import BeautifulSoup 
import sys 
from urllib import urlretrieve 
reload(sys) 


def imgGreb(): 
    site_url = "http://www.style.com" 
    html = urllib2.urlopen(site_url).read() 
    soup = BeautifulSoup(html) 
    img=soup.findAll(['img']) 
    for i in img: 
     try: 
      # built the complete URL using the domain and relative url you scraped 
      url = site_url + i.get('src') 
      # get the file name 
      name = "result_" + url.split('/')[-1] 
      # detect if that is a type of pictures you want 
      type = name.split('.')[-1] 
      if type in ['jpg', 'png', 'gif']: 
       # if so, retrieve the pictures 
       urlretrieve(url, name) 
     except: 
      pass 

if __name__ == '__main__': 
    imgGreb()

来源

2013-11-03 17:29:37

给出的完整回溯错误，它会产生错误：InvalidURL：nonnumeric port：'void（0）;' – randomp

@randomp我暂时删除了你的OOP部分，因为它在开始时很混乱。也许你可以尝试一下，看看这些代码是否有效。如果是这样，你可以重新使用OOP。 –

当然。非常感谢！ – randomp

Python的BeautifulSoup网页图像抓取器IO错误：[错误2]没有这样的文件或目录

回答

相关问题