2014-12-19 49 views
3

我得到的游离碱转储一些维基百科网址:蟒蛇URL解码%E3

网址1:http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa

网址2:http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%E3o_Costa

他们都指向同一个页面上维基百科:

url 3:http://pt.wikipedia.org/wiki /Pedro_Miguel_de_Castro_Brandão_Costa

urllib.unquote在url上工作1

url = 'Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa' 
url = urllib.unquote(url) 
url = urllib.unquote(url) 
print url 

结果是

Pedro_Miguel_de_Castro_Brandão_Costa 

但URL不起作用2.

url = 'Pedro_Miguel_de_Castro_Brand%E3o_Costa' 
url = urllib.unquote(url) 
print url 

结果是

Pedro_Miguel_de_Castro_Brand�o_Costa  

是否有什么问题吗?

回答

4

前者是双引号的UTF-8,自从您的终端使用UTF-8后,通常会打印出来。后者引用Latin-1,它需要先解码。

>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa' 
Pedro_Miguel_de_Castro_Brand�o_Costa 
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'.decode('latin-1') 
Pedro_Miguel_de_Castro_Brandão_Costa 
+1

我需要添加'encode('utf8')'才能正确打印出来。也就是'print'...'。decode('latin-1')。encode('utf8')'。非常感谢您的快速帮助。 – icycandy 2014-12-19 07:19:33