用Python中的unicode挣扎

我试图从大量文件中自动提取数据，并且它在大多数情况下都能正常工作。当它遇到非ASCII字符时，它就会崩溃：用Python中的unicode挣扎

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 5: ordinal not in range(128)

如何将我的品牌设置为UTF-8？我的代码正在从别的东西（这是使用lxml）重新调整用途，并没有任何问题。我见过很多关于编码/解码的讨论，但我不明白我应该如何实现它。下面的代码被删减到相关的代码 - 我已经删除了其余的代码。

i = 0 

filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))] 

for i in range (len(filenames)): 
    pathname = filenames[i] 

    fin = open(pathname, 'r') 
    with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f: 
     f.write(u'File Path|Brand\n') 
     lines = fin.read() 
     brand_start = lines.find("Brand Title") 
     brand_end = lines.find("/>",brand_start) 
     brand = lines [brand_start+47:brand_end-2] 
     f.write(u'{}|{}\n'.format(pathname[4:35],brand)) 

flog.close()

我敢肯定有一个更好的方式来写了整个事情，但此刻我的重点就是要弄明白如何获得线/读取功能，使用UTF-8的工作。

来源

2015-04-20 Nick

您应该显示完整的错误，包括回溯。除了别的，它说错误发生在哪一行。 –

http://nedbatchelder.com/text/unipain.html – tripleee

您正在混合使用Unicode值的字节串;您fin文件对象产生字节串，并且您使用Unicode在这里混吧：

f.write(u'{}|{}\n'.format(pathname[4:35],brand))

brand是一个字节串，插值到Unicode格式字符串。无论是有解码brand，或者更好的是，使用io.open()（而不是codecs.open()，这是不一样强大的新io模块）来管理都文件：

with io.open('Assets.log', 'w', encoding='utf-8') as f,\ 
     io.open(pathname, encoding='utf-8') as fin: 
    f.write(u'File Path|Brand\n') 
    lines = fin.read() 
    brand_start = lines.find(u"Brand Title") 
    brand_end = lines.find(u"/>", brand_start) 
    brand = lines[brand_start + 47:brand_end - 2] 
    f.write(u'{}|{}\n'.format(pathname[4:35], brand))

也似乎解析出手工制作XML文件;也许你想用ElementTree API来解析出这些值。在这种情况下，您将打开没有io.open()的文件，因此生成字节字符串，以便XML解析器可以正确地将信息解码为Unicode值。

来源

2015-04-20 18:11:10

谢谢，解决了根本问题。最后一个问题是它会覆盖文件内容，所以我只得到两行“File Path | Brand”和“SYNT0000000000001045-20150331T095311Z | Something Here |”。我将'w'更改为'a'，但每隔一行重复一次文件路径|品牌。建议？ – Nick

@Nick：为什么不在任何循环的外部创建文件？ –

另外，你是对的。我已经使用lxml传递xml的一部分。这应该是一个快速和肮脏的解决方案，因为我不知道如何解决这个特定的场景（结构中有许多类似的孩子）。一旦我解决了立即需要从文件中获取信息的问题，我将打开一个单独的线程以正常工作。 – Nick

这是我最后的代码，使用上面的指导。这不太好，但它解决了这个问题。我会看得到它全部采用LXML在稍后的日期（因为这是我所遇到的不同工作时之前，更大的XML文件）工作：

import lxml 
import io 
import os 

from lxml import etree 
from glob import glob 

nsmap = {'xmlns': 'thisnamespace'} 

i = 0 

filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))] 

with io.open(('Assets.log'),'w',encoding='utf-8') as f: 
    f.write(u'File Path|Series|Brand\n') 

    for i in range (len(filenames)): 
     pathname = filenames[i] 
     parser = lxml.etree.XMLParser() 
     tree = lxml.etree.parse(pathname, parser) 
     root = tree.getroot() 
     fin = open(pathname, 'r') 

     with io.open(pathname, encoding='utf-8') as fin: 

      for info in root.xpath('//somepath'): 
       series_x = info.find ('./somemorepath') 
       series = series_x.get('Asset_Name') if series_x != None else 'Missing' 
       lines = fin.read() 
       brand_start = lines.find(u"sometext") 
       brand_end = lines.find(u"/>",brand_start) 
       brand = lines [brand_start:brand_end-2] 
       brand = brand[(brand.rfind("/"))+1:] 
       f.write(u'{}|{}|{}\n'.format(pathname[5:42],series,brand)) 

f.close()

有人将现在沿过来做全部在一行！

来源

2015-04-21 12:51:53 Nick

用Python中的unicode挣扎

回答

相关问题