使用python从文件中删除非ASCII字符

-1

我有一个场景，其中发送用于分析的日志文件有一些非ASCII字符，并最终打破了我无法控制的分析工具之一。所以我决定自己清理一下这个日志，并且提出了以下这个工作，除了当我看到这些字符时我会跳过整条线。我尝试逐行字符（检查注释）的代码，以便只有这些字符可以被删除并保存实际的ASCII字符，但不能成功。该评论逻辑和建议/解决方案能否解决该问题的任何原因？使用python从文件中删除非ASCII字符

1：02：失败

采样线54.934/174573 ENQÎNULSUB AY NULEOT/29/abcdefghijg

功能来读取和删除线：

def readlogfile(self, abs_file_name): 
    """ 
    Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes 
    abs_file_name file name should be absolute path 
    """ 
    try: 
     infile = open(abs_file_name, 'rb') 
     for line in infile: 
      try: 
       line.decode('ascii') 
       self._data_bytes.append(line) 
      except UnicodeDecodeError as e : 
       # print line + "Invalid line skipped in " + abs_file_name 
       print line 
       continue 
      # while 1: #code that didn't work to remove just the non-ascii chars 
      #  char = infile.read(1)   # read characters from file 
      #  if not char or ord(char) > 127 or ord(char) < 0: 
      #   continue 
      #  else: 
      #   sys.stdout.write(char) 
      #   #sys.stdout.write('{}'.format(ord(char))) 
      #   #print "%s ord = %d" % (char, ord(char)) 
      #   self._data_bytes.append(char) 
    finally: 
     infile.close()

来源

2015-11-04 Guruprasad

http://stackoverflow.com/questions/33511317/removing-non-ascii-characters-from-file-text/33511747#33511747这家伙原代码应该为你工作。 –

de代码需要另一个参数，如何处理不好的字符。 https://docs.python.org/2/library/stdtypes.html#string-methods

试试这个

print "1:02:54.934/174573ENQÎNULSUBáyNULEOT/29/abcdefghijg".decode("ascii", "ignore")

u'1:02:54.934/174573ENQNULSUByNULEOT/29/abcdefghijg'

，你的代码可以简化到像这样

def readlogfile(self, abs_file_name): 
    """ 
    Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes 
    abs_file_name file name should be absolute path 
    """ 
    with open(abs_file_name, 'rb') as infile: 
     while True: 
      line = infile.readline() 
      if not line: 
       break 
      self._data_bytes.append(line.decode("ascii", "ignore"))

来源

2015-11-04 17:13:49

你可以建议如何复制具有特殊字符的实际文本？我相信还有一些其他角色在复制时错过了，并且解析器仍然与解析器断裂。 @ Dave_750 – Guruprasad

你也可以尝试line.decode（“ascii”，“ignore”）。encode（“ascii”）如果它仍然很挑剔 –

我认为这是处理上得罪行有道逐字符的基础：

import codecs 

class MyClass(object): 
    def __init__(self): 
     self._data_bytes = [] 

    def readlogfile(self, abs_file_name): 
     """ 
     Reads and skips the non-ascii chars line from the attached log file and 
     populate the list self.data_bytes abs_file_name file name should be 
     absolute path 
     """ 
     with codecs.open(abs_file_name, 'r', encoding='utf-8') as infile: 
      for line in infile: 
       try: 
        line.decode('ascii') 
       except UnicodeError as e: 
        ascii_chars = [] 
        for char in line: 
         try: 
          char.decode('ascii') 
         except UnicodeError as e2: 
          continue # ignore non-ascii characters 
         else: 
          ascii_chars.append(char) 
        line = ''.join(ascii_chars) 
       self._data_bytes.append(str(line))

来源

2015-11-04 17:57:13 martineau

使用python从文件中删除非ASCII字符

回答

相关问题