的HTMLParser和BeautifulSoup不解码HTML实体正确

我试图从HTML源代码段既HTMLParser和BeautifulSoup的HTMLParser和BeautifulSoup不解码HTML实体正确

然而解码HTML entities既不似乎完全正常工作。即他们不解码斜杠。

我的Python版本是2.7.11与BeautifulSoup版本3.2.1

print 'ORIGINAL STRING: %s \n' % original_url_string 

#clean up 
try: 
    # Python 2.6-2.7 
    from HTMLParser import HTMLParser 
except ImportError: 
    # Python 3 
    from html.parser import HTMLParser 

h = HTMLParser() 
url_string = h.unescape(original_url_string) 

print 'CLEANED WITH html.parser: %s \n' % url_string 

decoded = BeautifulSoup(original_url_string,convertEntities=BeautifulSoup.HTML_ENTITIES) 

print 'CLEANED WITH BeautifulSoup: %s \n' % decoded.contents

让我等的输出：

ORIGINAL STRING: api.soundcloud.com%2Ftracks%2F277561480&#038;show_artwork=true&#038;maxwidth=1050&#038;maxheight=1000 

CLEANED WITH html.parser: api.soundcloud.com%2Ftracks%2F277561480&show_artwork=true&maxwidth=1050&maxheight=1000 

CLEANED WITH BeautifulSoup: [u'api.soundcloud.com%2Ftracks%2F277561480&show_artwork=true&maxwidth=1050&maxheight=1000']

缺少什么我在这里？

我应该尝试在提取网址之前解码整个HTML页面吗？

有没有更好的方法来用Python做到这一点？

来源

2016-08-30 ian

您是否试图解码来自url或url的html的斜杠？

如果您试图解码斜杠，它们不是HTML entities，而是百分比编码的字符。

urllib有你需要的方法：

import urllib 
urllib.unquote(original_url_string) 
>>> 'api.soundcloud.com/tracks/277561480&#038;show_artwork=true&#038;maxwidth=1050&#038;maxheight=1000'

如果你想在HTML解码，你首先要get它包像requests或urllib

来源

2016-08-31 10:29:41 4140tm

的HTMLParser和BeautifulSoup不解码HTML实体正确

回答

相关问题