python：打开并读取包含德国变音符号的文件作为unicode

我已经编写了我的程序来读取文本文件中的单词，并将它们输入到sqlite数据库中，并将它视为字符串。但我需要输入一些包含日耳曼语的词：äöß。python：打开并读取包含德国变音符号的文件作为unicode

这里是一个准备一块代码：

我TREID都与＃ - - 编码：ISO-8859-15 - - 和＃ - - 编码：UTF-8 - - 无差异（！）

# -*- coding: iso-8859-15 -*- 
    import sqlite3 

    dbname = 'sampledb.db' 
    filename ='text.txt' 


    con = sqlite3.connect(dbname) 
    cur = con.cursor() 
    cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,name)''')  

    #f=open(filename) 
    #text = f.readlines() 
    #f.close() 

    text = u'süß' 

    print (text) 
    cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))  

    con.commit() 

    sentence = "The name is: %s" %(text,) 

    print (sentence) 
    f.close() 
    con.close()

上面的代码运行良好。但是我需要从包含单词'süß'的文件中读取'文本'。所以，当我取消了3条线（f.open（文件名）......），和评论文本=u'süß”它带来的误差

sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.

我试着编解码器模块读取一个UTF- 8，iso-8859-15。但我无法将它们解码为字符串'süß'，我需要在代码末尾完成我的句子。

在插入数据库之前，我尝试解码为utf-8。它的工作，但我不能用它作为字符串。

有没有一种方法可以从文件导入süß并将其用于插入到sqlite并使用字符串？

更多详细信息：

在这里，我增加更多的细节进行澄清。我以前使用过codecs.open。包含单词süß的文本文件保存为utf-8。使用f=codecs.open(filename, 'r', 'utf-8')和text=f.read()，我读取文件为unicode u'\ufeffs\xfc\xdf'。在sqlite3插入此unicode是顺利完成：cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))。

的问题是在这里：sentence = "The name is: %s" %(text,)给人u'The name is: \ufeffs\xfc\xdf'，我也需要print(text)作为我的输出苏斯，而print(text)带来了这个错误UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>。

谢谢。

来源

2014-03-03 Amin

的编码参数*应*有在你的'text'文字中有很大的不同。 –

澄清：模块顶部的编码声明会影响源代码中指定的“text =u'süß''。它对从文件读取的文本有*无效。你可以使用'codecs.open（）'作为后者。 – jfs

'readlines'返回一个列表。使用'f.read（）。strip（）'获取文件的文本为字符串。然后，你可以开始担心编码了。 – alexis

我可以理清这个问题。感谢您的帮助。

这就是：

# -*- coding: iso-8859-1 -*- 

import sys 
import codecs 
import sqlite3 

f = codecs.open("suess_sweet.txt", "r", "utf-8") # suess_sweet.txt file contains two 
text_in_unicode = f.read()       # comma-separated words: süß, sweet 
f.close() 

stdout_encoding = sys.stdout.encoding or sys.getfilesystemencoding() 

con = sqlite3.connect('dict1.db') 
cur = con.cursor() 
cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,German,English)''')  

[ger,eng] = text_in_unicode.split(',') 

cur.execute('''insert into table1 (id,German,English) VALUES (NULL,?,?)''',(ger,eng))  

con.commit() 

sentence = "The German word is: %s" %(ger,) 

print sentence.encode(stdout_encoding) 

con.close()

我从this page一些帮助（这是在德国）

，输出是：

The German word is: ?süß

还有一个小问题就是“？ ”。我认为在编码后，统一码u'被替换为?。 sentence给出：

>>> sentence 
u'The German word is: \ufeffs\xfc\xdf '

和编码的句子，得出：

>>> sentence.encode(stdout_encoding) 
'The German word is: ?s\xfc\xdf '

所以这不是我的想法。

一个简单的解决方案在我脑海中，摆脱问号的是使用替换功能：

sentence = "The German word is: %s" %(ger,) 
to_print = sentence.encode(stdout_encoding) 
to_print = to_print.replace('?','') 

>>> print(to_print) 
The German word is: süß

谢谢你:)

来源

2014-03-08 22:41:05 Amin

当您打开并读取文件时，您会得到8位字符串而不是Unicode。根据该文件是怎么写的，你可能需要使用'iso-8859-15'代替

f=codecs.open(filename, 'r', 'utf-8')

当然：要获得Unicode字符串改用codecs.open打开该文件。

编辑：一个大您的测试代码和注释掉的代码之间的区别在于，从文件读取产生一个列表，而测试是单个字符串。也许你的问题根本与Unicode无关。试着在你的测试代码进行这种替代，看看它是否会产生同样的错误：

text = [u'süß']

不幸的是，我没有在Python中的SQL足够的经验来帮助你进一步。

此外，当您打印list而不是单个字符串时，Unicode字符将被替换为其等效转义序列。要查看字符串的真实外观，请一次打印一个字符串。如果您好奇，这是__str__和__repr__之间的差异。

编辑2：字符u'\ufeff'被称为Byte Order Mark or BOM，由某些编辑者插入以指示该文件是真正的UTF-8。在使用字符串之前，您应该清除它。在文件的最开始只应该有一个。见例如Reading Unicode file data with BOM chars in Python

来源

2014-03-03 03:10:55

我在这个问题中增加了更多细节。 – Amin

@Amin，下次你告诉某人你已经添加了该问题的详细信息，请在**编辑后执行**。在完成我自己的编辑之前，我完全错过了它。 –

抱歉，感谢您的建议。 – Amin

python：打开并读取包含德国变音符号的文件作为unicode

回答

相关问题