使用python删除html标签？

我知道这可能有一百万个问题，但我想知道如何删除这些标签，而无需导入或使用HTMLParser或正则表达式。我尝试了一堆不同的替换语句来试图删除由<>所包含的部分字符串，但无济于事。使用python删除html标签？

基本上我一起工作是：

response = urlopen(url) 
html = response.read() 
html = html.decode()

从这里我只是试图操纵字符串变量HTML做以上。有没有办法像我指定的那样去做，或者你必须使用我见过的以前的方法吗？

我也试图让一个for循环，通过每一个角色去检查，如果它是封闭的，但由于某些原因，它不会给我一个正确的打印出来，那就是：

for i in html: 
    if i == '<': 
     html.replace(i, '') 
     delete = True 
    if i == '>': 
     html.replace(i, '') 
     delete = False 
    if delete == True: 
     html.replace(i, '')

会欣赏任何输入。

来源

2014-02-26 user2909869

请不要”使用正则表达式解析HTML。它不会工作，请参阅http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags获得有趣的解释。 –

_无需导入或使用HTMLParser或regex._为什么你给自己这样愚蠢的限制。 –

一个令人误解的标题 – Totem

str.replace返回一个字符串的副本，其中所有出现的子字符串被new替换，你不能像你那样使用它，你不应该修改你的循环迭代的字符串。额外的名单使用是的，你可以去的方法之一：

txt = [] 
for i in html: 
    if i == '<': 
     delete = True 
     continue 
    if i == '>': 
     delete = False 
     continue 
    if delete == True: 
     continue 

    txt.append(i)

现在txt列表包含结果的文字，你可以加入：

print ''.join(txt)

演示：

html = '<body><div>some</div><div>text</div></body>' 
#... 
>>> txt 
['s', 'o', 'm', 'e', 't', 'e', 'x', 't'] 
>>> ''.join(txt) 
'sometext'

来源

2014-02-26 14:11:05 ndpu

谢谢，我一直在寻找一种方法来做到这一点，而不必使用一些预先实施的方法，因为我没有从中学到任何东西。 – user2909869

使用python删除html标签？

回答

相关问题