Python美丽的汤.content物业

BeautifulSoup的.content做什么？我正在通过crummy.com's教程开展工作，我不太明白.content的作用。我看了论坛，我没有看到任何答案。看一下下面的代码....Python美丽的汤.content物业

from BeautifulSoup import BeautifulSoup 
import re 



doc = ['<html><head><title>Page title</title></head>', 
     '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', 
     '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', 
     '</html>'] 

soup = BeautifulSoup(''.join(doc)) 
print soup.contents[0].contents[0].contents[0].contents[0].name

我希望的代码打印出“身体”，而不是最后一行...

File "pe_ratio.py", line 29, in <module> 
    print soup.contents[0].contents[0].contents[0].contents[0].name 
    File "C:\Python27\lib\BeautifulSoup.py", line 473, in __getattr__ 
    raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr) 
AttributeError: 'NavigableString' object has no attribute 'name'

是。内容只关注HTML ，头和标题？如果，那为什么呢？

感谢您的帮助提前。

来源

2013-10-26 Robert Birch

我怀疑上述代码不起作用的原因是因为.content最初涉及html，title和head，但不是body，因为它在html层次结构中的不同类中。稍后在教程中，crummy使用下面的代码来打印身体，这让我怀疑身体是一个不同的层次结构。 head.nextSibling.name 如果有人绊倒这篇文章，重要的是阅读html结构。结帐[http://www.w3.org/TR/REC-html40/struct/global.html#h-7.5.1][1] [1]：http：//www.w3 .ORG/TR/REC-HTML40 /结构/ global.html＃H-7.5.1 –

它只是给你什么里面的标签。让我用一个例子证明：

html_doc = """ 
<html><head><title>The Dormouse's story</title></head> 

<p class="title"><b>The Dormouse's story</b></p> 

<p class="story">Once upon a time there were three little sisters; and their names were 
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 

<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc) 
head = soup.head 

print head.contents

上面的代码给了我一个名单，[<title>The Dormouse's story</title>]，因为多数民众赞成内的head标签。所以拨打[0]会给你列表中的第一项。

你得到一个错误的原因是因为soup.contents[0].contents[0].contents[0].contents[0]返回的东西没有更多的标签（因此没有属性）。它从您的代码返回Page Title，因为第一个contents[0]为您提供HTML标记，第二个为您提供head标记。第三个导致title标签，第四个给你的实际内容。所以，当你打电话给name时，它没有标签给你。

如果你想身体打印，你可以做到以下几点：

soup = BeautifulSoup(''.join(doc)) 
print soup.body

如果您在使用contents只，然后用下面要body：

soup = BeautifulSoup(''.join(doc)) 
print soup.contents[0].contents[1].name

使用你不会得到它[0]作为索引，因为body是head之后的第二个元素。

来源

2013-10-26 03:46:06

Python美丽的汤.content物业

回答

相关问题