如何使用beautifulsoup提取段标记中的完整文本

我需要从以下html代码中提取除<p><a href><rel>等之外的完整文本。如何使用beautifulsoup提取段标记中的完整文本

<p>Many of the features that made the Samsung Galaxy S4 one of the most anticipated phones in recent history -- such as its 5-inch 1920 x 1080 <a href="http://www.bubblews.com/news/421662-samsung-galaxy-s4-worlds-first-full-hd-super-amoled-display" rel="nofollow" target="_blank">Full HD Super AMOLED display</a>, its powerful processors (<a href="http://www.samsung.com/global/business/semiconductor/minisite/Exynos/blog_Spotlight_on_the_Exynos5Octa.html" rel="nofollow" target="_blank">Samsung Exynos 5 Octa</a> in the international version and <a href="http://www.qualcomm.com/snapdragon/blog/topics/snapdragon 600" rel="nofollow" target="_blank">Qualcomm Snapdragon 600</a> in the U.S. version) and 16GB, 32GB and 64GB storage options -- are now bringing grief to those who rushed to purchase the fourth-generation Galaxy S series smartphone upon its late April release.</p>

我曾尝试下面的代码

from bs4 import BeautifulSoup 
from urllib2 import urlopen 

BASE_URL = "http://www.chicagoreader.com" 

def get_category_links(section_url): 
    html = urlopen(section_url).read() 
    soup = BeautifulSoup(html, "lxml") 
    for div in soup.findall("div", attrs={'class':'field-content'}): 
      print div.find("p").content[0]

不过是给下面的输出

许多该做的最值得期待的手机的三星Galaxy S4在最近的历史特点 - - 例如它的5英寸1920 x 1080

我无法获得完整的文本，它应该给href和rel等标签后的文本，请告诉我如何得到下面的输出。

许多功能使三星Galaxy S4成为近期历史上最受期待的手机之一 - 例如其5英寸1920 x 1080全高清Super AMOLED显示其强大的处理器。三星Exynos 5 Octa在国际上“美国版高通Snapdragon 600）以及16GB，32GB和64GB存储选件 - 现在正在为那些在4月底发布的第四代Galaxy S系列智能手机购买产品而感到悲痛。

谢谢..

来源

2013-05-06 vittal cherala

您可以使用.text：

>>> from bs4 import BeautifulSoup 
>>> html = '<p>Many of the features that made the Samsung Galaxy S4 one of the most anticipated phones in recent history -- such as its 5-inch 1920 x 1080 <a href="http://www.bubblews.com/news/421662-samsung-galaxy-s4-worlds-first-full-hd-super-amoled-display" rel="nofollow" target="_blank">Full HD Super AMOLED display</a>, its powerful processors (<a href="http://www.samsung.com/global/business/semiconductor/minisite/Exynos/blog_Spotlight_on_the_Exynos5Octa.html" rel="nofollow" target="_blank">Samsung Exynos 5 Octa</a> in the international version and <a href="http://www.qualcomm.com/snapdragon/blog/topics/snapdragon 600" rel="nofollow" target="_blank">Qualcomm Snapdragon 600</a> in the U.S. version) and 16GB, 32GB and 64GB storage options -- are now bringing grief to those who rushed to purchase the fourth-generation Galaxy S series smartphone upon its late April release.</p>' 
>>> soup = BeautifulSoup(html) 
>>> print soup.p.text 
Many of the features that made the Samsung Galaxy S4 one of the most anticipated phones in recent history -- such as its 5-inch 1920 x 1080 Full HD Super AMOLED display, its powerful processors (Samsung Exynos 5 Octa in the international version and Qualcomm Snapdragon 600 in the U.S. version) and 16GB, 32GB and 64GB storage options -- are now bringing grief to those who rushed to purchase the fourth-generation Galaxy S series smartphone upon its late April release.

来源

2013-05-06 10:49:53 TerryA

谢谢，它的工作，但我需要从网站中提取完整的文本，我不应该在html变量中进行硬编码值，如你在上面提到的代码中提到的，它应该从url中提取，正如我在代码中提到的，请建议我这个怎么做。 – 2013-05-06 11:14:14

@vittalcherala对不起，我想但我似乎无法让你的代码工作。也许网站改变了？ – TerryA 2013-05-06 11:25:42

如何使用beautifulsoup提取段标记中的完整文本

回答

相关问题