2015-11-04 55 views
-1

我有一个场景,其中发送用于分析的日志文件有一些非ASCII字符,并最终打破了我无法控制的分析工具之一。所以我决定自己清理一下这个日志,并且提出了以下这个工作,除了当我看到这些字符时我会跳过整条线。我 尝试逐行字符(检查注释)的代码,以便只有这些字符可以被删除并保存实际的ASCII字符,但不能成功。 该评论逻辑和建议/解决方案能否解决该问题的任何原因?使用python从文件中删除非ASCII字符

1:02:失败

采样线54.934/174573 ENQÎNULSUB AY NULEOT/29/abcdefghijg

功能来读取和删除线:

def readlogfile(self, abs_file_name): 
    """ 
    Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes 
    abs_file_name file name should be absolute path 
    """ 
    try: 
     infile = open(abs_file_name, 'rb') 
     for line in infile: 
      try: 
       line.decode('ascii') 
       self._data_bytes.append(line) 
      except UnicodeDecodeError as e : 
       # print line + "Invalid line skipped in " + abs_file_name 
       print line 
       continue 
      # while 1: #code that didn't work to remove just the non-ascii chars 
      #  char = infile.read(1)   # read characters from file 
      #  if not char or ord(char) > 127 or ord(char) < 0: 
      #   continue 
      #  else: 
      #   sys.stdout.write(char) 
      #   #sys.stdout.write('{}'.format(ord(char))) 
      #   #print "%s ord = %d" % (char, ord(char)) 
      #   self._data_bytes.append(char) 
    finally: 
     infile.close() 
+0

http://stackoverflow.com/questions/33511317/removing-non-ascii-characters-from-file-text/33511747#33511747这家伙原代码应该为你工作。 –

回答

1

de代码需要另一个参数,如何处理不好的字符。 https://docs.python.org/2/library/stdtypes.html#string-methods

试试这个

print "1:02:54.934/174573ENQÎNULSUBáyNULEOT/29/abcdefghijg".decode("ascii", "ignore")

u'1:02:54.934/174573ENQNULSUByNULEOT/29/abcdefghijg' 

,你的代码可以简化到像这样

def readlogfile(self, abs_file_name): 
    """ 
    Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes 
    abs_file_name file name should be absolute path 
    """ 
    with open(abs_file_name, 'rb') as infile: 
     while True: 
      line = infile.readline() 
      if not line: 
       break 
      self._data_bytes.append(line.decode("ascii", "ignore")) 
+0

你可以建议如何复制具有特殊字符的实际文本?我相信还有一些其他角色在复制时错过了,并且解析器仍然与解析器断裂。 @ Dave_750 – Guruprasad

+0

你也可以尝试line.decode(“ascii”,“ignore”)。encode(“ascii”)如果它仍然很挑剔 –

0

我认为这是处理上得罪行有道逐字符的基础:

import codecs 

class MyClass(object): 
    def __init__(self): 
     self._data_bytes = [] 

    def readlogfile(self, abs_file_name): 
     """ 
     Reads and skips the non-ascii chars line from the attached log file and 
     populate the list self.data_bytes abs_file_name file name should be 
     absolute path 
     """ 
     with codecs.open(abs_file_name, 'r', encoding='utf-8') as infile: 
      for line in infile: 
       try: 
        line.decode('ascii') 
       except UnicodeError as e: 
        ascii_chars = [] 
        for char in line: 
         try: 
          char.decode('ascii') 
         except UnicodeError as e2: 
          continue # ignore non-ascii characters 
         else: 
          ascii_chars.append(char) 
        line = ''.join(ascii_chars) 
       self._data_bytes.append(str(line))