2013-04-07 45 views
0

没有与很多html标签,如下面的字符串,
u'find /home/tiger/workspace&nbsp; -name "[0-9]*"<br />find /home/tiger/workspace&nbsp; -name "[!0-9]*"<br />find /home/tiger/workspace&nbsp; -name "[^0-9]*"<br /><br />\u627e\u51fa\u6240\u6709\u5305\u542b\u6570\u5b57\u7684\u6587\u4ef6\uff0c\u4e0d\u5305\u542b\u6570\u5b57\u7684\u6587\u4ef6\u3002<br />[email protected]:~$ find /home/tiger&nbsp; -name "*[0-9]*"&nbsp; &gt;kan1<br />[email protected]:~$ find /home/tiger&nbsp; -name "[0-9]*"&nbsp; &gt;kan2<br />[email protected]:~$ find /home/tiger&nbsp; -name "*[0-9]"&nbsp; &gt;kan3<br /><br /><br />\u5305\u542b\u6570\u5b57\uff0c\u6570\u5b57\u5f00\u5934\uff0c\u6570\u5b57\u7ed3\u5c3e'如何从python中的字符串获取纯文本?

我怎样才能获得字符串中的纯文本删除HTML标签?

+1

[使用Python从HTML文件中提取文本]的可能重复(http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) – karthikr 2013-04-07 04:21:43

回答

0

使用html2text库:

>>> print html2text.html2text(s) 
find /home/tiger/workspace&nbsp_place_holder; -name "[0-9]*" 

find /home/tiger/workspace&nbsp_place_holder; -name "[!0-9]*" 

find /home/tiger/workspace&nbsp_place_holder; -name "[^0-9]*" 


找出所有包含数字的文件,不包含数字的文件。 

[email protected]:~$ find /home/tiger&nbsp_place_holder; -name 
"*[0-9]*"&nbsp_place_holder; >kan1 

[email protected]:~$ find /home/tiger&nbsp_place_holder; -name 
"[0-9]*"&nbsp_place_holder; >kan2 

[email protected]:~$ find /home/tiger&nbsp_place_holder; -name 
"*[0-9]"&nbsp_place_holder; >kan3 



包含数字,数字开头,数字结尾 

参考见Extracting text from HTML file using Python