替换python中的HTML代码

-1

我使用正则表达式来解析网站的源代码并在Tkinter窗口中显示新闻标题。我被告知用正则表达式解析HTML并不是最好的主意，但不幸的是现在没有时间去改变。替换python中的HTML代码

我似乎无法替换特殊字符的HTML代码，如撇号（'）。

目前，我有以下几点：

union_url = 'http://www.news.com.au/sport/rugby' 

def union(): 
    union_string = urlopen(union_url).read() 
    union_string.replace("&#8217;", "'") 
    union_headline = re.findall('(?:sport/rugby/.*) >(.*)<', union_string) 
    union_headline_label= Label(union_window, text = union_headline[0], font=('Times',20,'bold'), bg = 'White', width = 85, height = 3, wraplength = 500)

这不摆脱的HTML字符。作为一个例子，标题打印为

Larkham: Real worth of &#8216;Giteau&#8217;s Law&#8217;

我试图找到一个没有任何运气的答案。任何帮助深表感谢。

来源

2015-10-14 BlizzzX

你试图获取数据或从解析HTML源数据？ – Ja8zyjits

对不起 - 获取数据显示在tkinter小部件 – BlizzzX

曾听说过[美丽的汤]（http://www.crummy.com/software/BeautifulSoup/）你的生活将会更好用这个...解析HTML可以很难。 – Ja8zyjits

你可以使用应用re.sub（）来UNESCAPE的“调用”功能（或删除）任何转义：

>>> import re 
>>> def htmlUnescape(m): 
...  return unichr(int(m.group(1), 16)) 
... 
>>> re.sub('&#([^;]+);', htmlUnescape, "This is something &#8217; with an HTML-escaped character in it.") 
u'This is something \u8217 with an HTML-escaped character in it.' 
>>>

来源

2015-10-14 09:25:58

替换python中的HTML代码

回答

相关问题