2011-11-25 58 views
2

我有一个Java应用程序运行,它通过XML获取数据,但偶尔我有一些数据包含某种控制代码?控制代码0x6导致XML错误

An invalid XML character (Unicode: 0x6) was found in the CDATA section. 
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x6) was found in  the CDATA section. 
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source) 
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source) 
    at domain.Main.processLogFromUrl(Main.java:342) 
    at domain.Main.<init>(Main.java:67) 
    at domain.Main.main(Main.java:577) 

任何人都可以解释这是什么控制代码正是因为我找不到很多信息?

在此先感谢。

+0

参见维基百科:http://en.wikipedia.org/wiki/Acknowledge_character – rekire

+0

Java是没有错的,你的XML源被打破,你需要跟谁是负责创建它得到固定。类似的背景问题:http://stackoverflow.com/questions/2622552/parsing-unicode-character-0x2-using-xml1-1 – bobince

+0

如果你不期待一个Unicode字符,而UTF-8是你通常得到的,那么谁在响应中加入Unicode字符? – djangofan

回答

0

很明显,你为什么要得到这个字符将取决于数据意味着什么。 (显然它是ACK,但在一个文件中表现很奇怪......)但是,重要的一点是它使XML无效 - 您无法用XML表示该字符。

XML 1.0 specsection 2.2

字符范围

/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ 
Char  ::= #x9 | #xA | #xD | [#x20-#xD7FF] 
        | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

注意如何这排除下面的Unicode值U + 0020比U + 0009(制表符),U + 000A(其他换行)和U + 000D(回车)。

如果您对返回的数据有任何影响,则应该将其更改为返回有效的XML。如果不是,那么在将其解析为XML之前,必须对其进行一些预处理。对于不想要的控制角色,你想要做什么取决于他们在你的情况下有什么意义。

-2

尝试定义XML 1.1版:

<?xml version="1.1"?> 
+0

不会帮助。控制字符在XML 1.1中也受到限制。 – jasso

1

你需要写一个FilterInputStream SAX解析器得到它之前对数据进行筛选。它必须删除或重新编码不良数据。

Apache有一个super-flexible的例子。你可能希望把更简单的一个放在一起。

这是我的其中一个清理其他东西,但我相信这将是一个好的开始。

/* Cleans up often very bad xml. 
* 
* 1. Strips leading white space. 
* 2. Recodes &pound; etc to &#...;. 
* 3. Recodes lone & as &amp. 
* 
*/ 
public class XMLInputStream extends FilterInputStream { 

    private static final int MIN_LENGTH = 2; 
    // Everything we've read. 
    StringBuilder red = new StringBuilder(); 
    // Data I have pushed back. 
    StringBuilder pushBack = new StringBuilder(); 
    // How much we've given them. 
    int given = 0; 
    // How much we've read. 
    int pulled = 0; 

    public XMLInputStream(InputStream in) { 
    super(in); 
    } 

    public int length() { 
    // NB: This is a Troll length (i.e. it goes 1, 2, many) so 2 actually means "at least 2" 

    try { 
     StringBuilder s = read(MIN_LENGTH); 
     pushBack.append(s); 
     return s.length(); 
    } catch (IOException ex) { 
     log.warning("Oops ", ex); 
    } 
    return 0; 
    } 

    private StringBuilder read(int n) throws IOException { 
    // Input stream finished? 
    boolean eof = false; 
    // Read that many. 
    StringBuilder s = new StringBuilder(n); 
    while (s.length() < n && !eof) { 
     // Always get from the pushBack buffer. 
     if (pushBack.length() == 0) { 
     // Read something from the stream into pushBack. 
     eof = readIntoPushBack(); 
     } 

     // Pushback only contains deliverable codes. 
     if (pushBack.length() > 0) { 
     // Grab one character 
     s.append(pushBack.charAt(0)); 
     // Remove it from pushBack 
     pushBack.deleteCharAt(0); 
     } 

    } 
    return s; 
    } 

    // Returns false at eof. 
    // Might not actually push back anything but usually will. 
    private boolean readIntoPushBack() throws IOException { 
    // File finished? 
    boolean eof = false; 
    // Next char. 
    int ch = in.read(); 
    if (ch >= 0) { 
     // Discard whitespace at start? 
     if (!(pulled == 0 && isWhiteSpace(ch))) { 
     // Good code. 
     pulled += 1; 
     // Parse out the &stuff; 
     if (ch == '&') { 
      // Process the & 
      readAmpersand(); 
     } else { 
      // Not an '&', just append. 
      pushBack.append((char) ch); 
     } 
     } 
    } else { 
     // Hit end of file. 
     eof = true; 
    } 
    return eof; 
    } 

    // Deal with an ampersand in the stream. 
    private void readAmpersand() throws IOException { 
    // Read the whole word, up to and including the ; 
    StringBuilder reference = new StringBuilder(); 
    int ch; 
    // Should end in a ';' 
    for (ch = in.read(); isAlphaNumeric(ch); ch = in.read()) { 
     reference.append((char) ch); 
    } 
    // Did we tidily finish? 
    if (ch == ';') { 
     // Yes! Translate it into a &#nnn; code. 
     String code = XML.hash(reference); 
     if (code != null) { 
     // Keep it. 
     pushBack.append(code); 
     } else { 
     throw new IOException("Invalid/Unknown reference '&" + reference + ";'"); 
     } 
    } else { 
     // Did not terminate properly! 
     // Perhaps an & on its own or a malformed reference. 
     // Either way, escape the & 
     pushBack.append("&amp;").append(reference).append((char) ch); 
    } 
    } 

    private void given(CharSequence s, int wanted, int got) { 
    // Keep track of what we've given them. 
    red.append(s); 
    given += got; 
    log.finer("Given: [" + wanted + "," + got + "]-" + s); 
    } 

    @Override 
    public int read() throws IOException { 
    StringBuilder s = read(1); 
    given(s, 1, 1); 
    return s.length() > 0 ? s.charAt(0) : -1; 
    } 

    @Override 
    public int read(byte[] data, int offset, int length) throws IOException { 
    int n = 0; 
    StringBuilder s = read(length); 
    for (int i = 0; i < Math.min(length, s.length()); i++) { 
     data[offset + i] = (byte) s.charAt(i); 
     n += 1; 
    } 
    given(s, length, n); 
    return n > 0 ? n : -1; 
    } 

    @Override 
    public String toString() { 
    String s = red.toString(); 
    String h = ""; 
    // Hex dump the small ones. 
    if (s.length() < 8) { 
     Separator sep = new Separator(" "); 
     for (int i = 0; i < s.length(); i++) { 
     h += sep.sep() + Integer.toHexString(s.charAt(i)); 
     } 
    } 
    return "[" + given + "]-\"" + s + "\"" + (h.length() > 0 ? " (" + h + ")" : ""); 
    } 

    private boolean isWhiteSpace(int ch) { 
    switch (ch) { 
     case ' ': 
     case '\r': 
     case '\n': 
     case '\t': 
     return true; 
    } 
    return false; 
    } 

    private boolean isAlphaNumeric(int ch) { 
    return ('a' <= ch && ch <= 'z') 
     || ('A' <= ch && ch <= 'Z') 
     || ('0' <= ch && ch <= '9'); 
    } 
}