python请求从Google Translate下载不正确的声音文件

我使用下面的脚本下载中文老师，但是当我运行它时，我得到的文件与该URL中的文件不同。我认为这是一个编码问题，但正如我指定的UTF-8，我不知道发生了什么。python请求从Google Translate下载不正确的声音文件

#!/usr/bin/python 
# -*- coding: utf-8 -*- 

import requests 

url = "http://translate.google.com/translate_tts?tl=zh-CN&q=老師" 

r = requests.get(url) 

with open('test.mp3', 'wb') as test: 
    test.write(r.content)

UPDATE：

按@ abarnert的建议，我已经检查该文件是UTF-8 BOM和测试与 'IDNA' 的代码。

#!/usr/bin/python3 
# -*- coding: utf-8 -*- 

import requests 

url_1 = "http://translate.google.com/translate_tts?tl=zh-CN&q=老師" 
url_2 = "http://translate.google.com/translate_tts?tl=zh-CN&q=\u8001\u5e2b" 

r_1 = requests.get(url_1) 
r_1_b = requests.get(url_1.encode('idna')) 
r_2 = requests.get(url_2) 
r_2_b = requests.get(url_2.encode('idna')) 

# This downloads nonsense: 
with open('r_1.mp3', 'wb') as test: 
    test.write(r_1.content) 

# This throws the error specified at bottom: 
with open('r_1_b.mp3', 'wb') as test: 
    test.write(r_1_b.content) 

# This parses the characters individually, producing 
# a file consisting of "u, eight, zero..." in Mandarin 
with open('r_2.mp3', 'wb') as test: 
    test.write(r_2.content) 

# This produces a sound file consisting of "u, eight, zero, zero..." in Mandarin 
with open('r_2_b.mp3', 'wb') as test: 
    test.write(r_2_b.content)

我得到的错误是：

Traceback (most recent call last): 
    File "/home/MZ/Desktop/tts3.py", line 12, in <module> 
    r_1_b = requests.get(url_1.encode('idna')) 
    File "/usr/lib64/python2.7/encodings/idna.py", line 164, in encode 
    result.append(ToASCII(label)) 
    File "/usr/lib64/python2.7/encodings/idna.py", line 76, in ToASCII 
    label = nameprep(label) 
    File "/usr/lib64/python2.7/encodings/idna.py", line 21, in nameprep 
    newlabel.append(stringprep.map_table_b2(c)) 
    File "/usr/lib64/python2.7/stringprep.py", line 197, in map_table_b2 
    b = unicodedata.normalize("NFKC", al) 
TypeError: must be unicode, not str 
[Finished in 15.3s with exit code 1]

来源

2015-05-06 zadrozny

你在哪里指定了UTF-8？不在您的代码中，您的网址，您的源文件编码或任何我能看到的东西。 – abarnert

另外，这是Python 2还是3？ – abarnert

对不起，我忽略了标题。我已经在2和3中试过了。 – zadrozny

我已经能够重现你的问题在Python 2在Linux和Windows（虽然我得到的废话是每个不同）。但是我不能在Python 3中重现它，而且我也不认为你是真的做过。

简短版本是：你总是想要使用Unicode字符串文字如果你想包括非ASCII字符。在Python 2，这意味着u前缀（关于Python 3，u前缀是没有意义的，但无害的）：

url = u"http://translate.google.com/translate_tts?tl=zh-CN&q=老師"

而且做最安全的事情（因为接错编码在文本编辑器或您的编码声明可以在不影响任何东西）是：

url_2 = u"http://translate.google.com/translate_tts?tl=zh-CN&q=\u8001\u5e2b"

不这样做，你传递了一堆的UTF-8字节requests没有告诉它，他们是UTF-8。

我期望它在这种情况下做的是看sys.getdefaultencoding()，这可能是'ascii'至少在Mac和Linux上，尝试解码，并得到一个例外。在Windows上，它可能是'cp1252'或'big5'或任何你的系统设置，所以它可能会发送mojibake。

但实际上并没有这样做。我不确定什么它在做什么，但它正确地猜测在Mac上的UTF-8，做了奇怪的事情，导致在Linux上三种不同的音调“我呃”（我认为它只是将字节解释为等效的代码点，所以老变成了U + 00E8，U + 0080，U + 0081？），以及与Windows相同的第一个音节开始但具有不同的音节的不同和奇怪的东西。

对于url_2，这是一个有点简单：在2.x的非Unicode字符串，\u8001不被视为一个转义序列，它只是六个大字反斜杠，u，8，0，0 requests`将尽职地发送给谷歌，谷歌会翻译并发送给你，作为读出这些角色的人。

但是，如果添加u前缀，它们都可以工作。

而在Python 3中，有或没有u前缀，它们都可以工作。（有趣的是，在3.x中，即使前缀为b ...也可以工作，但显然只是因为它总是假设3.x中的字节是UTF-8;如果我给它Big5字节，它会将它们作为UTF-8编译，即使我的sys.getdefaultencoding是正确的。）

此外，手动查询字符串编码的查询工作，但这不是必要的，并没有任何区别。

来源

2015-05-07 03:54:14 abarnert

python请求从Google Translate下载不正确的声音文件

回答

相关问题