2012-11-09 27 views
1

我正在编写一个脚本,用于从文章中提取内容并删除任何不必要的内容,例如。脚本和样式。美丽的汤不断提高以下异常:美丽的汤错误:'<class'bs4.element.Tag'>'对象没有属性'内容'?

'<class 'bs4.element.Tag'>' object has no attribute 'contents' 

这里的装饰功能的代码(元素是一个包含网页的内容中的HTML元素):

def trim(element): 
    elements_to_remove = ('script', 'style', 'link', 'form', 'object', 'iframe') 
    for i in elements_to_remove: 
     remove_all_elements(element, i) 

    attributes_to_remove = ('class', 'id', 'style') 
    for i in attributes_to_remove: 
     remove_all_attributes(element, i) 

    remove_all_comments(element) 

    # Remove divs that have more non-p elements than p elements 
    for div in element.find_all('div'): 
     p = len(div.find_all('p')) 
     img = len(div.find_all('img')) 
     li = len(div.find_all('li')) 
     a = len(div.find_all('a')) 

     if p == 0 or img > p or li > p or a > p: 
      div.decompose() 

看着堆栈跟踪,这个问题似乎从这个方法来正确后的声明:

# Remove divs that have more non-p elements than p elements 
    for div in element.find_all('div'): 
     p = len(div.find_all('p')) # <-- div.find_all('p') 

我不知道为什么bs4.element.Tag的这个实例没有属性“内容”?我尝试了一下在实际网页和元素充满了P公司和IMG的的...

这里的回溯(这是一个Django项目我工作的一部分):

Environment: 


Request Method: POST 
Request URL: http://localhost:8000/read/add/ 

Django Version: 1.4.1 
Python Version: 2.7.3 
Installed Applications: 
('django.contrib.auth', 
'django.contrib.contenttypes', 
'django.contrib.sessions', 
'django.contrib.sites', 
'django.contrib.messages', 
'django.contrib.staticfiles', 
'home', 
'account', 
'read', 
'review') 
Installed Middleware: 
('django.middleware.common.CommonMiddleware', 
'django.contrib.sessions.middleware.SessionMiddleware', 
'django.middleware.csrf.CsrfViewMiddleware', 
'django.contrib.auth.middleware.AuthenticationMiddleware', 
'django.contrib.messages.middleware.MessageMiddleware') 


Traceback: 
File "/home/marco/.virtualenvs/sandra/local/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response 
    111.       response = callback(request, *callback_args, **callback_kwargs) 
File "/home/marco/sandra/read/views.py" in add 
    24.    Article.objects.create_article(request.user, url) 
File "/home/marco/sandra/read/models.py" in create_article 
    11.   title, content = logic.process_html(web_page.read()) 
File "/home/marco/sandra/read/logic.py" in process_html 
    7.  soup = htmlbarber.give_haircut(BeautifulSoup(html_code, 'html5lib')) 
File "/home/marco/sandra/read/htmlbarber/__init__.py" in give_haircut 
    45.  scissor.trim(element) 
File "/home/marco/sandra/read/htmlbarber/scissor.py" in trim 
    35.   p = len(div.find_all('p')) 
File "/home/marco/.virtualenvs/sandra/local/lib/python2.7/site-packages/bs4/element.py" in find_all 
    1128.   return self._find_all(name, attrs, text, limit, generator, **kwargs) 
File "/home/marco/.virtualenvs/sandra/local/lib/python2.7/site-packages/bs4/element.py" in _find_all 
    413.     return [element for element in generator 
File "/home/marco/.virtualenvs/sandra/local/lib/python2.7/site-packages/bs4/element.py" in descendants 
    1140.   if not len(self.contents): 
File "/home/marco/.virtualenvs/sandra/local/lib/python2.7/site-packages/bs4/element.py" in __getattr__ 
    924.    "'%s' object has no attribute '%s'" % (self.__class__, tag)) 

Exception Type: AttributeError at /read/add/ 
Exception Value: '<class 'bs4.element.Tag'>' object has no attribute 'contents' 

这里的的remove_all_ *函数的源代码:

def remove_all_elements(element_to_clean, unwanted_element_name): 
    for to_remove in element_to_clean.find_all(unwanted_element_name): 
     to_remove.decompose() 

def remove_all_attributes(element_to_clean, unwanted_attribute_name): 
    for to_inspect in [element_to_clean] + element_to_clean.find_all(): 
     try: 
      del to_inspect[unwanted_attribute_name] 
     except KeyError: 
      pass 

def remove_all_comments(element_to_clean): 
    for comment in element_to_clean.find_all(text=lambda text:isinstance(text, Comment)): 
     comment.extract() 
+0

这很奇怪。如果您发布了完整的追踪信息,这将有所帮助。 – Iguananaut

+0

您可能需要对for循环中的所有'remove_all_elements'调用进行求和。它不会解决当前的问题,但它会增强代码的可读性。 – iTayb

+0

@iTayb感谢提示:) –

回答

1

我认为问题是,在remove_all_elements或者在你的代码别的地方要删除你的一些标签的contents属性。

看起来像是在您拨打to_remove.decompose()时发生这种情况。下面是该方法的源:

def decompose(self): 
    """Recursively destroys the contents of this tree.""" 
    self.extract() 
    i = self 
    while i is not None: 
     next = i.next_element 
     i.__dict__.clear() 
     i = next 

这里,如果你调用这个函数手动发生了什么:

>> soup = BeautifulSoup('<div><p>hi</p></div>') 
>>> d0 = soup.find_all('div')[0] 
>>> d0 
<div><p>hi</p></div> 
>>> d0.decompose() 
>>> d0 
Traceback (most recent call last): 
... 
Traceback (most recent call last): 
AttributeError: '<class 'bs4.element.Tag'>' object has no attribute 'contents' 

看来,一旦在一个标签叫decompose你不应该试图使用该标签再次。我不太确定这是怎么发生的。

有一件事我会尝试检查是len(element.__dict__) > 0在您的trim()函数中的所有时间。

+0

我不确定,我使用del语句的唯一地方是del to_inspect [unwanted_attribute_name]。 –

+0

我注释掉了所有remove_all_ *函数,但它仍然不起作用。然后,我取消了所有的注释,并用p = 99替换了p = len(div.find_all('p')),而且令人惊讶的是,它停止了抛出异常! :OI发现奇怪的... –

+0

我用div.extract()替换了div.decompose()并且它工作了:)可能是因为div在for语句中被定义为element.find_all的当前元素的别名'div')和'分解'div导致这些问题? –