在java中更改编码

我正在写一个函数，它应该检测使用的字符集，然后将其切换到utf-8。我正在使用juniversalchardet，它是mozilla的universalchardet的java端口。
这是我的代码：在java中更改编码

private List<List<String>> setProperEncoding(List<List<String>> input) { 
    try { 

     // Detect used charset 
     UniversalDetector detector = new UniversalDetector(null); 

     int position = 0; 
     while ((position < input.size()) & (!detector.isDone())) { 
      String row = null; 
      for (String cell : input.get(position)) { 
       row += cell; 
      } 
      byte[] bytes = row.getBytes(); 
      detector.handleData(bytes, 0, bytes.length); 
      position++; 
     } 
     detector.dataEnd(); 

     Charset charset = Charset.forName(detector.getDetectedCharset()); 
     Charset utf8 = Charset.forName("UTF-8"); 
     System.out.println("Detected charset: " + charset); 

     // rewrite input using proper charset 
     List<List<String>> newLines = new ArrayList<List<String>>(); 
     for (List<String> row : input) { 
      List<String> newRow = new ArrayList<String>(); 
      for (String cell : row) { 
       //newRow.add(new String(cell.getBytes(charset))); 
       ByteBuffer bb = ByteBuffer.wrap(cell.getBytes(charset)); 
       CharBuffer cb = charset.decode(bb); 
       bb = utf8.encode(cb); 
       newRow.add(new String(bb.array())); 
      } 
      newLines.add(newRow); 
     } 

     return newLines; 

    } catch (Exception e) { 
     e.printStackTrace(); 
     return input; 
    } 
}

我的问题是，当我阅读例如波兰的字母，如L，A，C和similiar字母替换的字符文件？和其他奇怪的事情。我究竟做错了什么？编辑：编辑我使用eclipse。

方法参数是读取MultipartFile的结果。只需使用FileInputStream获取每一行，然后通过某个分隔符分割everyline（它已为xls，xlsx和csv文件准备好）。没有什么特别的。

来源

2013-07-16 Pierwola

你是如何编译你的代码的？ Eclipse？命令提示符？蚂蚁？ Maven？ – VirtualTroll

一旦你在'字符串'中输入了字符，它们就已经是字符，而不是字节。 – gaborsch

“输入”的来源是什么？请为此显示您的代码。 – gaborsch

首先，你的数据在二进制格式的某处。为了简单起见，我想它来自InputStream。

你想写输出为UTF-8字符串，我想它可以是一个OutputStream。

我建议创建一个AutoDetectInputStream：

public class AutoDetectInputStream extends InputStream { 
    private InputStream is; 
    private byte[] sampleData = new byte[4096]; 
    private int sampleLen; 
    private int sampleIndex = 0; 

    public AutoDetectStream(InputStream is) throws IOException { 
     this.is = is; 
     // pre-read the data 
     sampleLen = is.read(sampleData); 
    } 

    public Charset getCharset() { 
     // detect the charset 
     UniversalDetector detector = new UniversalDetector(null); 
     detector.handleData(sampleData, 0, sampleLen); 
     detector.dataEnd(); 
     return detector.getDetectedCharset(); 
    } 

    @Override 
    public int read() throws IOException { 
     // simulate the stream for the reader 
     if(sampleIndex < sampleLen) { 
      return sampleData[sampleIndex++]; 
     } 
     return is.read(); 
    } 
}

第二个任务是很简单因为Java在UTF-8存储字符串（字符），所以只需使用一个简单的OutputStreamWriter。所以，这里是你的代码：

// open input with Detector stream 
// we use BufferedReader so we could read lines 
InputStream is = new FileInputStream("in.txt"); 
AutoDetectInputStream detector = new AutoDetectInputStream(is); 
Charset charset = detector.getCharset(); 
// here we can use the charset to decode the bytes into characters 
BufferedReader rdr = new BufferedReader(new InputStreamReader(detector, charset)); 

// open output to write to 
OutputStream os = new FileOutputStream("out.txt"); 
Writer utf8Writer = new OutputStreamWriter(os, Charset.forName("UTF-8")); 

// copy the whole file 
String line; 
while((line = rdr.readLine()) != null) { 
    utf8Writer.append(line); 
} 

// close streams   
rdr.close(); 
utf8Writer.flush(); 
utf8Writer.close();

所以，最后你得到所有的txt文件转码为UTF-8。

请注意，缓冲区大小应该足够大，以便输入UniversalDetector。

来源

2013-07-16 16:41:18 gaborsch

完美的作品！谢谢！你是最棒的！更多 - 你是最好的！ – Pierwola

@Pierwola：D：D谢谢，我总是很高兴看到我能不能帮助别人，他们也很欣赏它:) – gaborsch

它可以工作，但我的文本转换为“ћонгол”лсын≈р？нхийл？гч“улгарт？ рийн2223жил“。大多数字母是正确的，一些字母是错的。郎是蒙古人。欢迎您的回复：D – Enxtur

在java中更改编码

回答

相关问题