2013-05-10 35 views
2

呃,Python的2/3的是如此令人沮丧......考虑这个例子,test.py在Python 2和Python 3中获取相同的Unicode字符串长度?

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import sys 
if sys.version_info[0] < 3: 
    text_type = unicode 
    binary_type = str 
    def b(x): 
    return x 
    def u(x): 
    return unicode(x, "utf-8") 
else: 
    text_type = str 
    binary_type = bytes 
    import codecs 
    def b(x): 
    return codecs.latin_1_encode(x)[0] 
    def u(x): 
    return x 

tstr = " ▲ " 

sys.stderr.write(tstr) 
sys.stderr.write("\n") 
sys.stderr.write(str(len(tstr))) 
sys.stderr.write("\n") 

运行它:

$ python2.7 test.py 
▲ 
5 
$ python3.2 test.py 
▲ 
3 

太好了,我得到两个不同的字符串大小。希望将字符串包装在我在网上发现的其中一个包装中会有帮助?

tstr = text_type(" ▲ ")对于:

$ python2.7 test.py 
Traceback (most recent call last): 
    File "test.py", line 21, in <module> 
    tstr = text_type(" ▲ ") 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128) 
$ python3.2 test.py 
▲ 
3 

对于tstr = u(" ▲ ")

$ python2.7 test.py 
Traceback (most recent call last): 
    File "test.py", line 21, in <module> 
    tstr = u(" ▲ ") 
    File "test.py", line 11, in u 
    return unicode(x) 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128) 
$ python3.2 test.py 
▲ 
3 

对于tstr = b(" ▲ ")

$ python2.7 test.py 
▲ 
5 
$ python3.2 test.py 
Traceback (most recent call last): 
    File "test.py", line 21, in <module> 
    tstr = b(" ▲ ") 
    File "test.py", line 17, in b 
    return codecs.latin_1_encode(x)[0] 
UnicodeEncodeError: 'latin-1' codec can't encode character '\u25b2' in position 1: ordinal not in range(256) 

对于tstr = binary_type(" ▲ ")

$ python2.7 test.py 
▲ 
5 
$ python3.2 test.py 
Traceback (most recent call last): 
    File "test.py", line 21, in <module> 
    tstr = binary_type(" ▲ ") 
TypeError: string argument without an encoding 

那么,这当然会让事情变得简单。

那么,如何在Python 2.7和3.2中获得相同的字符串长度(本例中为3)呢?

回答

3

嘛,原来unicode()在Python 2.7有encoding说法,那显然有助于:

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import sys 
if sys.version_info[0] < 3: 
    text_type = unicode 
    binary_type = str 
    def b(x): 
    return x 
    def u(x): 
    return unicode(x, "utf-8") 
else: 
    text_type = str 
    binary_type = bytes 
    import codecs 
    def b(x): 
    return codecs.latin_1_encode(x)[0] 
    def u(x): 
    return x 

tstr = u(" ▲ ") 

sys.stderr.write(tstr) 
sys.stderr.write("\n") 
sys.stderr.write(str(len(tstr))) 
sys.stderr.write("\n") 

运行,我得到我需要的东西:

$ python2.7 test.py 
▲ 
3 
$ python3.2 test.py 
▲ 
3