2012-03-18 100 views
1

我有一个文档有两种内容类型:text/xml和text/html。我想用BeautifulSoup来解析文档,并最终得到一个干净的文本版本。该文档以元组的形式开始,因此我一直使用repr将其变成BeautifulSoup识别的内容,然后使用find_all通过搜索div来查找文档的文本/ html位,如下所示:使用BeautifulSoup从文本/ html文档获取干净的文本

soup = BeautifulSoup(repr(msg_data)) 
text = soup.html.find_all("div") 

然后,我将文本转换回字符串,将其保存到一个变量,然后把它放回汤对象并调用get_text就可以了,就像这样:

str_text = str(text) 
soup_text = BeautifulSoup(str_text) 
soup_text.get_text() 

然而,然后改变编码为unicode,如下所示:

u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17  
PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 
9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while 
browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, 
\xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives 
them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]' 

当我试图重新编码为UTF-8,像这样:

soup.encode('utf-8') 

我回未解析类型。

我想让我把干净的文本保存为一个字符串,然后我可以在文本中找到特定的东西(例如,上面的文本中的“小狗”)。

基本上,我在这里跑来跑去。谁能帮忙?与往常一样,非常感谢您为您提供的任何帮助。

回答

2

编码不被破坏;这正是它应该的。 '\xa0'是非破坏性空间的Unicode。

如果你想这个(Unicode)的字符串作为ASCII编码,你可以告诉编解码器忽略任何字符不理解:

>>> x = u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]' 
>>> x.encode('ascii', 'ignore') 
'[9:16 PMErica: with images, and that seemed long to me anyway, 9:17 PMme: yeah, Erica: so feel free to make it shorter, or rather, please do, 9:18 PMnobody wants to read about that shit for 2 pages, me: :), Erica: while browsing their site, me: srsly, Erica: unless of course your writing is magic, me: My writing saves drowning puppies, Just plucks him right out and gives them a scratch behind the ears and some kibble, Erica: Maine is weird, me: haha]' 

如果你有时间,你应该看斯内德尔德最近视频Pragmatic Unicode。它会使一切变得简单明了!

+0

是的,它发生在我身上,正如我发布的那样,“毁了”有点强烈,现在就编辑它。 谢谢你的视频,我会看看。你是否有任何我可以仔细阅读的文本资源(我知道这些只是谷歌搜索了,但有没有你特别喜欢?) – spikem 2012-03-18 19:59:40

+0

@spikem你期待什么?你有一个非ASCII字符的字符串(非空格)。你不能把它们魔法化。 – katrielalex 2012-03-18 20:01:55

+0

我不认为我问过,或者预计他们会被抹去,我只是不完全熟悉unicode,这就是我来这里问的原因。 – spikem 2012-03-18 20:05:18

相关问题