如何从python中的字符串获取纯文本？

没有与很多html标签，如下面的字符串，
u'find /home/tiger/workspace  -name "[0-9]*" find /home/tiger/workspace  -name "[!0-9]*" find /home/tiger/workspace  -name "[^0-9]*" \u627e\u51fa\u6240\u6709\u5305\u542b\u6570\u5b57\u7684\u6587\u4ef6\uff0c\u4e0d\u5305\u542b\u6570\u5b57\u7684\u6587\u4ef6\u3002 [email protected]:~$ find /home/tiger  -name "*[0-9]*"  >kan1 [email protected]:~$ find /home/tiger  -name "[0-9]*"  >kan2 [email protected]:~$ find /home/tiger  -name "*[0-9]"  >kan3 \u5305\u542b\u6570\u5b57\uff0c\u6570\u5b57\u5f00\u5934\uff0c\u6570\u5b57\u7ed3\u5c3e'如何从python中的字符串获取纯文本？

我怎样才能获得字符串中的纯文本删除HTML标签？

来源

2013-04-07 it_is_a_literature

[使用Python从HTML文件中提取文本]的可能重复（http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python） – karthikr 2013-04-07 04:21:43

使用html2text库：

>>> print html2text.html2text(s) 
find /home/tiger/workspace&nbsp_place_holder; -name "[0-9]*" 

find /home/tiger/workspace&nbsp_place_holder; -name "[!0-9]*" 

find /home/tiger/workspace&nbsp_place_holder; -name "[^0-9]*" 


找出所有包含数字的文件，不包含数字的文件。 

[email protected]:~$ find /home/tiger&nbsp_place_holder; -name 
"*[0-9]*"&nbsp_place_holder; >kan1 

[email protected]:~$ find /home/tiger&nbsp_place_holder; -name 
"[0-9]*"&nbsp_place_holder; >kan2 

[email protected]:~$ find /home/tiger&nbsp_place_holder; -name 
"*[0-9]"&nbsp_place_holder; >kan3 



包含数字，数字开头，数字结尾

参考见Extracting text from HTML file using Python。

来源

2013-04-07 05:10:14 jterrace

如何从python中的字符串获取纯文本？

回答

相关问题