子类beautifulsoup HTML解析器，得到错误类型

我使用beautifulsoup伟大的HTML解析器子类beautifulsoup HTML解析器，得到错误类型

最近我试图通过类属性来提高代码，并直接提供包装类的所有beautifulsoup方法（而不是写了一个小包装），我认为继承美丽的解析器将是实现这一目标的最佳方式。

这里是类：

class ScrapeInputError(Exception):pass 
from BeautifulSoup import BeautifulSoup 

class Scrape(BeautifulSoup): 
    """base class to be subclassed 
    basically a subclassed BeautifulSoup wrapper that providers 
    basic url fetching with urllib2 
    and the basic html parsing with beautifulsoup 
    and some basic cleaning of head,scripts etc'""" 

    def __init__(self,file): 
     self._file = file 
     #very basic input validation 
     import re 
     if not re.search(r"^http://",self._file): 
      raise ScrapeInputError,"please enter a url that starts with http://" 

     import urllib2 
     #from BeautifulSoup import BeautifulSoup 
     self._page = urllib2.urlopen(self._file) #fetching the page 
     BeautifulSoup.__init__(self,self._page) 
     #self._soup = BeautifulSoup(self._page) #calling the html parser

这样我就可以开始与

x = Scrape("http://someurl.com")

类，并能遍历树x.elem或x.find

这个工程与一些美丽的方法wonderfull（见上文），但与其他人失败 - 那些使用迭代器像“for e in x：”

错误消息：

Traceback (most recent call last): 
    File "<pyshell#86>", line 2, in <module> 
    print e 
    File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__ 
    value = self.sockio.remotecall(self.oid, self.name, args, kwargs) 
    File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall 
    seq = self.asynccall(oid, methodname, args, kwargs) 
    File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall 
    self.putmessage((seq, request)) 
    File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage 
    s = pickle.dumps(message) 
    File "C:\Python27\lib\copy_reg.py", line 77, in _reduce_ex 
    raise TypeError("a class that defines __slots__ without " 
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

我研究了错误消息，但无法找到任何东西，我可以一起工作 - becasue我不想BeautifulSoup内植入玩（和诚实我不知道或理解__slot__或__getstate__ ..）我只是想使用的功能。

，而不是子类我试图从类的__init__返回beautifulsoup对象，但__init__方法返回None

要高兴的任何帮助这里。

来源

2011-10-07 alonisser

旁注：不要使用're'测试一个字符串的子开始，这是矫枉过正。改为使用'str.startswith（）'。（'如果不是file.startswith（“http：//”）：'）。 –

感谢费迪南德！ – alonisser

另一个旁注：你真的想禁止'https：//'吗？（或者'ftp：//'，或者'file：//'？）你可能想依靠'urlopen'自己的验证;它会在无效URL上引发'urllib2.URLError'。 –

BeautifulSoup代码中没有发生该错误。相反，您的IDLE无法检索并打印对象。改为尝试print str(e)。

无论如何，子类BeautifulSoup在你的情况可能不是最好的主意。你真的想继承所有的解析方法（如convert_charref,handle_pi或error）吗？更糟糕的是，如果你重写BeautifulSoup使用的东西，它可能会以难以找到的方式破解。

我不知道你的情况，但我建议preferring composition over inheritance（即在属性中有一个BeautifulSoup对象）。您可以轻松地（如果在一个稍微哈克的方式）公开具体方法是这样的：

class Scrape(object): 
    def __init__(self, ...): 
     self.soup = ... 
     ... 
     self.find = self.soup.find

来源

2011-10-07 09:23:40

感谢petr viktorin！我会尝试构图的方式！ – alonisser

此方法是否也适用于__iter__和__key__方法？ – alonisser

[No]（http://docs.python.org/reference/datamodel.html#special-method-lookup-for-new-style-classes），但你仍然可以做'def __iter __（self）：return iter （self.soup）'。 –

子类beautifulsoup HTML解析器，得到错误类型

回答

相关问题