下载xml，删除bom并编码utf8

我正在从FTP服务器下载XML。我必须为我的SAX解析器做好准备。为此，我需要删除BOM字节并将其编码为UTF-8。但不知何故，它不适用于每个文件。下载xml，删除bom并编码utf8

这里是我的两个功能代码：

public static void copy(File src, File dest){ 

    try { 
     byte[] data = Files.readAllBytes(src.toPath()); 

     writeAsUTF8(dest, skipBom(data)); 

    } catch (IOException e) { 
     e.printStackTrace(); 
    } 
} 


private static void writeAsUTF8(File out, byte[] data){ 

    try { 

     FileOutputStream outStream = new FileOutputStream(out); 
     OutputStreamWriter outUTF = new OutputStreamWriter(outStream,"UTF8"); 

     outUTF.write(new String(data, "UTF8")); 
     //outUTF.write(new String(data)); 
     outUTF.flush(); 
     outStream.close(); 
     outUTF.close(); 
    } 
    catch(Exception ex){ 
     ex.printStackTrace(); 
    } 
} 

    private static byte[] skipBom(byte[] data){ 

    int skipBytes = getBomSize(data); 

    byte[] tmp = new byte[data.length - skipBytes]; 

    for(int x = 0; x < tmp.length; x++){ 
     tmp[x] = data[x + skipBytes]; 
    } 

    return tmp; 
}

任何想法我做错了什么？

来源

2014-01-27 Adam Sam

您是否尝试过任何的想法，从[这个问题]（http://stackoverflow.com/questions/1835430/byte-order -mark螺丝-UP-文件读入的Java /）？ – andyb

为什么要删除BOM字节？你只需要用文件的编码将文件读入字符串，然后使用UTF-8编码将字符串写入文件。

来源

2014-01-27 14:29:18 fatih

我不会，但随后在与SAX解析器读取它（第1行的符号是无效的，或者类似的东西） –

你们用什么饲料的SAX解析器我得到一个异常？当你提供一个包含阅读器的输入源时（知道字节必须被读作utf-8），那么一切都应该没问题。或者我理解错了什么？ – fatih

@faith：不，这并不总是奏效。如果输入流中的第一个字节是BOM，那么SAX会抱怨非法字节并引发异常。在将数据交给SAX之前，您需要摆脱第一个字节。 – alexraasch

我找不出你的代码有什么问题。我前段时间遇到同样的问题，我使用下面的代码来做到这一点。首先，下面的函数读取跳过第一个字节的文件。当然，如果您确定所有文件都有BOM，这当然是有道理的。

public byte[] load (File inputFile, int lines) throws Exception { 

    try (BufferedReader reader 
     = new BufferedReader(
      new InputStreamReader(
       new FileInputStream(inputFile), "UTF-8"))) 
    { 
     // Discard the Byte Order Mark 
     int firstByte = reader.read(); 

     String line = null; 
     int lineCount = 0; 

     StringBuilder builder = new StringBuilder(); 
     while(lineCount <= lines && (line = reader.readLine()) != null) { 
      lineCount += 1; 
      builder.append(line + "\n"); 
     } 
    } 

    return builder.toString().getBytes(); 
}

您可以重写上述函数，以UTF-8将数据写回另一个文件。我偶尔使用以下方法转换磁盘上的文件以将其从ISO转换为UTF-8：

public static void convertToUTF8 (Path p) throws Exception { 
    Path docPath = p; 
    Path docPathUTF8 = docPath; 

    InputStreamReader in = new InputStreamReader(new FileInputStream(docPath.toFile()), StandardCharsets.ISO_8859_1); 

    CharBuffer cb = CharBuffer.allocate(100 * 1000 * 1000); 
    int c = -1; 

    while ((c = in.read()) != -1) { 
     cb.put((char) c); 
    } 
    in.close(); 

    OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(docPathUTF8.toFile()), StandardCharsets.UTF_8); 

    char[] x = new char[cb.position()]; 
    System.arraycopy(cb.array(), 0, x, 0, x.length); 

    out.write(x); 
    out.flush(); 
    out.close(); 
}

来源

2014-01-27 14:41:39 alexraasch

简化。

writeAsUTF8(dest, data); 



try { 
    int BOM_LENGTH = "\uFFFE".getBytes(StandardCharsets.UTF_8); 
    if (!new String(data, 0, BOM_LENGTH).equals("\uFFFE")) { 
     BOM_LENGTH = 0; 
    } 
    FileOutputStream outStream = new FileOutputStream(out); 
    outStream.write(data, BOM_LENGTH, data.length - BOM_LENGTH)); 
    outStream.close(); 
} 
catch(Exception ex){ 
    ex.printStackTrace(); 
}

这检查BOM（U + FFFE）是否存在。仅读出全部作为字符串将是更简单的：

String xml = new String(data, StandardCharsets.UTF_8); 
xml = xml.replaceFirst("^\uFFFE", "");

使用字符集，而不是字符串编码参数是指一个异常少捉：UnsupportedEncodingException（一个IOException）。

检测XML编码：

String xml = new String(data, StandardCharsets.ISO_8859_1); 
String encoding = xml.replaceFirst(
     "(?s)^.*<\\?xml.*encoding=([\"'])([\\w-]+)\\1.*\\?>.*$", 
     "$2"); 

if (encoding.equals(xml)) { 
    encoding = "UTF-8"; 
} 
xml = new String(data, encoding); 
xml = xml.replaceFirst("^\uFFFE", "");

来源

2014-01-27 14:44:15

BOM不是问题，删除它始终有效。主要问题是编码，我正在用.readAllBytes（）读取字节，然后尝试将它保存为utf-8。源文件可以有任何编码，但最后它必须是utf8。 –

使用XML中声明的编码添加。 –

此 “”（αS） “^。* <\\？XML。*编码=（\”']）（\ W +）\\ 1。* \\？>。* $”， “2 $” ）;”在编码外来'“'，缺少反斜杠，忘了'-'： doestn工作 –

下载xml，删除bom并编码utf8

回答

相关问题