如何通过网址抓取（python）捕获所有可能的错误？

在我的应用程序中，用户输入一个URL，然后尝试打开链接并获取页面标题。但是我意识到可能存在许多不同类型的错误，包括标题中的unicode字符或换行符，以及AttributeError和IOError。我第一次尝试捕捉每个错误，但现在如果出现url提取错误，我想重定向到用户将手动输入标题的错误页面。我如何捕获所有可能的错误？这是我现在的代码：如何通过网址抓取（python）捕获所有可能的错误？

title = "title" 

    try: 

     soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url)) 
     title = str(soup.html.head.title.string) 

     if title == "404 Not Found": 
      self.redirect("/urlparseerror") 
     elif title == "403 - Forbidden": 
      self.redirect("/urlparseerror")  
     else: 
      title = str(soup.html.head.title.string).lstrip("\r\n").rstrip("\r\n") 

    except UnicodeDecodeError:  
     self.redirect("/urlparseerror?error=UnicodeDecodeError") 

    except AttributeError:   
     self.redirect("/urlparseerror?error=AttributeError") 

    #https url:  
    except IOError:   
     self.redirect("/urlparseerror?error=IOError") 


    #I tried this else clause to catch any other error 
    #but it does not work 
    #this is executed when none of the errors above is true: 
    # 
    #else: 
    # self.redirect("/urlparseerror?error=some-unknown-error-caught-by-else")

UPDATE

正如我说try...except一边写title到数据库中的意见建议由@Wooble：

 try: 
      new_item = Main(
         .... 
         title = unicode(title, "utf-8")) 

      new_item.put() 

     except UnicodeDecodeError:  

      self.redirect("/urlparseerror?error=UnicodeDecodeError")

这工作。尽管外的范围内的字符â€”仍处于title根据日志记录信息：

***title: 7.2. re â€” Regular expression operations &mdash; Python v2.7.1 documentation**

你知道为什么吗？

来源

2011-03-05 Zeynel

一个的UnicodeDecodeError几乎可以肯定是因为你的代码不正确处理Unicode的，不会因为用户输入无效数据。你应该修复你的应用程序来处理unicode。 – 2011-03-07 23:52:47

您可以使用except，但不指定任何类型来捕获所有异常。

从python文档http://docs.python.org/tutorial/errors.html：（即一个例外是IO错误或ValueError异常的不）

import sys 

try: 
    f = open('myfile.txt') 
    s = f.readline() 
    i = int(s.strip()) 
except IOError as (errno, strerror): 
    print "I/O error({0}): {1}".format(errno, strerror) 
except ValueError: 
    print "Could not convert data to an integer." 
except: 
    print "Unexpected error:", sys.exc_info()[0] 
    raise

最后除了将赶上以前尚未抓到任何异常

来源

2011-03-05 23:32:00 Hernan

好的。我用最后一个'except'子句改变了代码，但是即使现在'UnicodeDecodeError'也没有被捕获：UnicodeDecodeError：'ascii'编解码器无法解码位置12中的字节0xe2：序号不在范围内（128）' em-dash在这个URL：'http：// docs.python.org/library/string.html'）我做错了什么？ – Zeynel 2011-03-05 23:56:32

感谢您的回答。解决了这个问题。 – Zeynel 2011-03-06 19:45:00

您可以使用顶级异常类型Exception，它会捕获之前没有捕获到的任何异常。

http://docs.python.org/library/exceptions.html#exception-hierarchy

try: 

    soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url)) 
    title = str(soup.html.head.title.string) 

    if title == "404 Not Found": 
     self.redirect("/urlparseerror") 
    elif title == "403 - Forbidden": 
     self.redirect("/urlparseerror")  
    else: 
     title = str(soup.html.head.title.string).lstrip("\r\n").rstrip("\r\n") 

except UnicodeDecodeError:  
    self.redirect("/urlparseerror?error=UnicodeDecodeError") 

except AttributeError:   
    self.redirect("/urlparseerror?error=AttributeError") 

#https url:  
except IOError:   
    self.redirect("/urlparseerror?error=IOError") 

except Exception, ex: 
    print "Exception caught: %s" % ex.__class__.__name__

来源

2011-03-05 23:56:49 ssoler

谢谢。但是这也没有发现unicode错误。不知道我做错了什么。 – Zeynel 2011-03-06 00:04:30

@Zeynel，你可以在python的异常层次结构中看到（http://docs.python.org/library/exceptions.html#exception-hierarchy）UnicodeDecodeError是Exception的一个子类型，所以应该抓住它。可能是你的错误出现在你的代码的不同部分。 – ssoler 2011-03-06 00:31:25

@ssoler：是的，当我尝试将标题写入数据库时发生错误。标题中有一个unicode错误，它不会写入。试图捕捉URL获取错误的关键是避免处理python unicode恶梦。似乎没有办法用'try ... except'来捕捉Unicode错误。我不想处理unicode问题，所以我放弃了......这意味着用户在提交url时需要输入标题。我很惊讶，在互联网技术的这个阶段，我无法得到一个页面的标题没有错误！那么，我不知道该说些什么...... – Zeynel 2011-03-06 00:53:03

如何通过网址抓取（python）捕获所有可能的错误？

回答

相关问题