阅读双字节文件

-1

我不知道是否有在Tcl的一种简单的方法来读取一个双字节文件（或因此我认为它被称为）。我的问题是，当我在打开记事本（我在Win7上）时看到的文件看起来很好，但是当我在Tcl中读取它们时，每个字符之间都有空格（或者更确切地说，是空字符）。阅读双字节文件

我目前的解决方法已经先运行一个string map删除所有的空

string map {\0 {}} $file

，然后正常处理信息，但有一个简单的方法来做到这一点，通过fconfigure，encoding或另一种方式？

我不熟悉的编码，所以我不知道我应该使用什么参数。

fconfigure $input -encoding double

当然失败，因为double不是一个有效的编码。与“doublebyte”相同。

实际上，我工作的大文本文件（超过2 GB），做我的“处理方法”由线的基础上一条线，所以我认为，这减缓了下跌过程。

编辑：正如@mhawke所指出的，该文件是UTF-16-LE编码的，这显然不是受支持的编码。有没有一种优雅的方式来规避这个缺点，也许通过proc？或者这会使事情比使用string map更复杂？

来源

2014-11-24 Jerry

，我决定写一个小PROC转换文件。我使用的是while循环，因为读3 GB的文件到一个单一变量完全锁定的过程...的意见使它看起来很长，但它不是那么长。

proc itrans {infile outfile} { 
    set f [open $infile r] 

    # Note: files I have been getting have CRLF, so I split on CR to keep the LF and 
    # used -nonewline in puts 
    fconfigure $f -translation cr -eof "" 

    # Simple switch just to remove the BOM, since the result will be UTF-8 
    set bom 0        
    set o [open $outfile w] 
    while {[gets $f l] != -1} { 
    # Convert to binary where the specific characters can be easily identified 
    binary scan $l H* l 

    # Ignore empty lines 
    if {$l == "" || $l == "00"} {continue} 

    # If it is the first line, there's the BOM 
    if {!$bom} { 
     set bom 1 

     # Identify and remove the BOM and set what byte should be removed and kept 
     if {[regexp -nocase -- {^(?:FFFE|FEFF)} $l m]} { 
     regsub -- "^$m" $l "" l 

     if {[string toupper $m] eq "FFFE"} { 
      set re "(..).." 
     } elseif {[string toupper $m] eq "FEFF"} { 
      set re "..(..)" 
     } 
     } 
     regsub -all -- $re $l {\1} new 
    } else { 
     # Regardless of utf-16-le or utf-16-be, that should work since we split on CR 
     regsub -all -- {..(..)|00$} $l {\1} new 
    } 
    puts -nonewline $o [binary format H* $new] 
    } 
    close $o 
    close $f 
} 

itrans infile.txt outfile.txt

最后警告，实际使用所有16位这会搞乱字符（例如编码单元序列04 30将失去04和成为30代替成为D0 B0as it should be in Table 3-4，但00 4D将正确映射到4D）在一个字符默默无闻，因此请确保您不介意，或者您的文件在尝试以上之前不包含此类字符。

来源

2014-12-23 12:29:50 Jerry

输入文件可能是UTF-16编码为常见Windows中。

尝试：

% fconfigure $input -encoding unicode

可以使用获得的编码列表：

% encoding names 
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine gb2312 jis0201 euc-cn euc-jp iso8859-10 macThai iso2022-jp jis0208 macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania gb1988 iso2022-kr macTurkish macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 koi8-r iso8859-4 macCroatian ebcdic cp1250 iso8859-5 iso8859-6 macCyrillic cp1251 iso8859-7 cp1252 koi8-u macDingbats iso8859-8 cp1253 cp1254 iso8859-9 cp1255 cp850 cp932 cp1256 cp852 cp1257 identity cp1258 macJapan utf-8 shiftjis cp936 cp855 symbol cp775 unicode cp857

来源

2014-11-24 09:17:08 mhawke

这样做也给了我很多的'？'，而那些不存在于原始文件。例如，我在文件“26-MAR-2014 22：03：47”中有一个日期时间值，这就变成了“26-MAR-2”。 3：47'。也许这可能有助于确定编码？ – Jerry 2014-11-24 09:21:57

我也在十六进制编辑器中打开了文件，前两个字节是'FF FE'，如果有帮助的话。 – Jerry 2014-11-24 09:26:05

0xFF 0XFE是一个[Byte Order Mark]（http://en.wikipedia.org/wiki/Byte_order_mark#UTF-16），表示该文件被编码为UTF-16，并具有小尾序。所以这个文件应该被认为是UTF-16-LE。但似乎在Tcl中“unicode”没有明确规定（取决于本地平台），并且没有utf-16-le或utf-16-be编码选项。 – mhawke 2014-11-24 10:50:27

阅读双字节文件

回答

相关问题