我使用beautifulsoup伟大的HTML解析器子类beautifulsoup HTML解析器,得到错误类型
最近我试图通过类属性来提高代码,并直接提供包装类的所有beautifulsoup方法(而不是写了一个小包装),我认为继承美丽的解析器将是实现这一目标的最佳方式。
这里是类:
class ScrapeInputError(Exception):pass
from BeautifulSoup import BeautifulSoup
class Scrape(BeautifulSoup):
"""base class to be subclassed
basically a subclassed BeautifulSoup wrapper that providers
basic url fetching with urllib2
and the basic html parsing with beautifulsoup
and some basic cleaning of head,scripts etc'"""
def __init__(self,file):
self._file = file
#very basic input validation
import re
if not re.search(r"^http://",self._file):
raise ScrapeInputError,"please enter a url that starts with http://"
import urllib2
#from BeautifulSoup import BeautifulSoup
self._page = urllib2.urlopen(self._file) #fetching the page
BeautifulSoup.__init__(self,self._page)
#self._soup = BeautifulSoup(self._page) #calling the html parser
这样我就可以开始与
x = Scrape("http://someurl.com")
类,并能遍历树x.elem或x.find
这个工程与一些美丽的方法wonderfull(见上文),但与其他人失败 - 那些使用迭代器像“for e in x:”
错误消息:
Traceback (most recent call last):
File "<pyshell#86>", line 2, in <module>
print e
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 77, in _reduce_ex
raise TypeError("a class that defines __slots__ without "
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled
我研究了错误消息,但无法找到任何东西,我可以一起工作 - becasue我不想BeautifulSoup内植入玩(和诚实我不知道或理解__slot__
或__getstate__
..)我只是想使用的功能。
,而不是子类我试图从类的__init__
返回beautifulsoup对象,但__init__
方法返回None
要高兴的任何帮助这里。
旁注:不要使用're'测试一个字符串的子开始,这是矫枉过正。改为使用'str.startswith()'。 ('如果不是file.startswith(“http://”):')。 –
感谢费迪南德! – alonisser
另一个旁注:你真的想禁止'https://'吗? (或者'ftp://',或者'file://'?)你可能想依靠'urlopen'自己的验证;它会在无效URL上引发'urllib2.URLError'。 –