如何让urllib只引用有效的%编码字符串?python urllib无法引用损坏
html_parser = HTMLParser.HTMLParser()
url = '[email protected]#*%ed%20&'
print urllib2.unquote(url)
print html_parser.unescape(url)
结果是
[email protected]#*� &
[email protected]#*%ed%20&
的urllib引文结束 '%20' '',但它也错引文结束 '%ED' 到 ''
的HTMLParser能逃脱“&安培; “为 '&',但它不能将 '%20' ''
--------------编辑------
我道歉不能很好地解释我的问题,实际上我有很多字符串需要处理,有些是URL,有些则不是。原始字符串是[email protected]#*%ed
,我将字符串设为[email protected]#*%ed%20&
以包含这两种情况。事实证明,很难在一行代码中处理这两种情况。阅读的答案后,我写我自己的函数
#!/bin/env python
#coding: utf8
import sys
import os
import HTMLParser
import re
import urllib
html_parser = HTMLParser.HTMLParser()
url_pattern = re.compile('^(ftp|http|https)://.{4,}', flags=re.I)
def unquote_string(url):
if url_pattern.search(url):
while True:
url1 = urllib.unquote(url)
if url1 == url: break
url = url1
else:
while True:
url1 = html_parser.unescape(url)
if url1 == url: break
url = url1
return url
url = '[email protected]#*%ed%20&'
print urllib.unquote(url)
print html_parser.unescape(url)
print unquote_string(url)
看起来我混淆了'%'和'&':-(。Post updated –