多字节字符 - 模式匹配

我正在阅读Shift-JIS编码的XML文件并将其存储在ByteBuffer中，然后将其转换为字符串并尝试通过Pattern & Matcher找到字符串的开头和字符串的结尾。从这两个位置我尝试写缓冲区到一个文件。它在没有多字节字符的情况下工作。如果有一个多字节字符，我想在年底一些文本，因为最终的价值是有点偏离多字节字符 - 模式匹配

static final Pattern startPattern = Pattern.compile("<\\?xml "); 
static final Pattern endPattern = Pattern.compile("</doc>\n"); 

public static void main(String[] args) throws Exception { 
    File f = new File("20121114000606JA.xml"); 
    FileInputStream fis = new FileInputStream(f); 
    FileChannel fci = fis.getChannel(); 
    ByteBuffer data_buffer = ByteBuffer.allocate(65536); 
    while (true) { 
     int read = fci.read(data_buffer); 
     if (read == -1) 
     break; 
    } 

    ByteBuffer cbytes = data_buffer.duplicate(); 
    cbytes.flip(); 
    Charset data_charset = Charset.forName("UTF-8"); 
    String request = data_charset.decode(cbytes).toString(); 

    Matcher start = startPattern.matcher(request); 
    if (start.find()) { 
     Matcher end = endPattern.matcher(request); 

     if (end.find()) { 

     int i0 = start.start(); 
     int i1 = end.end(); 

     String str = request.substring(i0, i1); 

     String filename = "test.xml"; 
     FileChannel fc = new FileOutputStream(new File(filename), false).getChannel(); 

     data_buffer.position(i0); 
     data_buffer.limit(i1 - i0); 

     long offset = fc.position(); 
     long sz = fc.write(data_buffer); 

     fc.close(); 
     } 
    } 
    System.out.println("OK"); 
    }

来源

2012-11-15 Vjy

如果您正在阅读按住Shift JIS编码的XML，你为什么要使用UTF-8解码数据？ –

使用字符串指数 I0和I1为字节位置：

data_buffer.position(i0); 
data_buffer.limit(i1 - i0);

是错误的。由于UTF-8没有给出唯一的编码，所以ĉ被写成两个字符c +结合变音标记^，字符和字节之间的来回转换不仅很昂贵，而且容易出错（在特定数据的情况下）。

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(
     new File(filename)), "UTF-8"));

或者使用实现CharSequence的CharBuffer。

而不是写入FileChannel FC的：

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(
     new File(filename)), "UTF-8")); 
try { 
    out.write(str); 
} finally { 
    out.close(); 
}

成CharBuffer版本将需要更多的重写，也触及模式匹配。

来源

2012-11-15 15:14:28

由于字符串索引与字节位置之间存在差异，所以您处于正确的轨道上，但是当您评论关于“'”写为两个字符'c' +并结合变音符号时，则完全错误地指向了方向''^”。问题是一些字符表示为多个字节，而不是某些字形集群表示为多个字符。 – ruakh

不知道如何去使用BufferedWriter或CharBuffer，请帮助 – Vjy

@ruakh你是对的，我的评论很混乱，并且有点问题。我想说，将字节转换为unicode然后再转换为字节不一定会给出相同的字节，反之亦然。 –

你的问题在这里似乎与你的字节缓冲区的解码。您正在使用UTF-8 CharSet解码Shift-JIS ByteBuffer。您需要将其更改为Shift-JIS CharSet。这些是supported character encodings。

虽然我没有按住Shift JIS文件来测试，你应该尝试改变CharSet.forName行：

Charset data_charset = Charset.forName("Shift_JIS");

另外，你的正则表达式的逻辑是有点过。我不会使用第二个匹配器，因为这会导致搜索重新开始并可能导致反向范围。相反，尝试获取当前匹配的位置，然后更改您的匹配使用模式：

Matcher matcher = startPattern.matcher(request); 
if (matcher.find()) { 
    int i0 = matcher.start(); 
    matcher.usePattern(endPattern); 

    if (matcher.find()) { 

    int i1 = matcher.end();

因为移位-JIS是two byte encoding system，应该清晰地映射到Java的UTF-8字符。这应该允许您将其与单个模式（如“START。* END”）匹配并仅使用组来获取您的数据。

来源

2012-11-15 17:54:45

将字符集更改为Shift_JIS不起作用，匹配器上的捕获很好。谢谢 – Vjy

这很奇怪。我编写了一个包含2个字节的UTF-8字符的正则表达式和字符串的测试用例。这工作得很好。你能发布一个输入文件，并开始/结束已知不起作用的模式吗？ –

文件已上传@ http://dict.pricetweet.us/20121114000606JA.xml。我用模式更新了这个问题。非常感谢你的帮助 – Vjy

要正确转码此文件，您应该使用Java的XML API。虽然有几种方法可以做到这一点，但下面是使用javax.xml.transform包的解决方案。首先，我们确实需要文档中引用的djnml-1.0b.dtd文件（以防它包含实体引用）。由于缺少此解决方案，此解决方案使用从提供的输入生成的DTD，使用Trang：

<?xml encoding="UTF-8"?> 

<!ELEMENT doc (djnml)> 
<!ATTLIST doc 
    xmlns CDATA #FIXED '' 
    destination NMTOKEN #REQUIRED 
    distId NMTOKEN #REQUIRED 
    md5 CDATA #REQUIRED 
    msize CDATA #REQUIRED 
    sysId NMTOKEN #REQUIRED 
    transmission-date NMTOKEN #REQUIRED> 

<!ELEMENT djnml (head,body)> 
<!ATTLIST djnml 
    xmlns CDATA #FIXED '' 
    docdate CDATA #REQUIRED 
    product NMTOKEN #REQUIRED 
    publisher NMTOKEN #REQUIRED 
    seq CDATA #REQUIRED 
    xml:lang NMTOKEN #REQUIRED> 

<!ELEMENT head (copyright,docdata)> 
<!ATTLIST head 
    xmlns CDATA #FIXED ''> 

<!ELEMENT body (headline,text)> 
<!ATTLIST body 
    xmlns CDATA #FIXED ''> 

<!ELEMENT copyright EMPTY> 
<!ATTLIST copyright 
    xmlns CDATA #FIXED '' 
    holder CDATA #REQUIRED 
    year CDATA #REQUIRED> 

<!ELEMENT docdata (djn)> 
<!ATTLIST docdata 
    xmlns CDATA #FIXED ''> 

<!ELEMENT headline (#PCDATA)> 
<!ATTLIST headline 
    xmlns CDATA #FIXED '' 
    brand-display NMTOKEN #REQUIRED 
    prefix CDATA #REQUIRED> 

<!ELEMENT text (pre,p+)> 
<!ATTLIST text 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn (djn-newswires)> 
<!ATTLIST djn 
    xmlns CDATA #FIXED ''> 

<!ELEMENT pre EMPTY> 
<!ATTLIST pre 
    xmlns CDATA #FIXED ''> 

<!ELEMENT p (#PCDATA)> 
<!ATTLIST p 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn-newswires (djn-press-cutout,djn-urgency,djn-mdata)> 
<!ATTLIST djn-newswires 
    xmlns CDATA #FIXED '' 
    news-source NMTOKEN #REQUIRED 
    origin NMTOKEN #REQUIRED 
    service-id NMTOKEN #REQUIRED> 

<!ELEMENT djn-press-cutout EMPTY> 
<!ATTLIST djn-press-cutout 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn-urgency (#PCDATA)> 
<!ATTLIST djn-urgency 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn-mdata (djn-coding)> 
<!ATTLIST djn-mdata 
    xmlns CDATA #FIXED '' 
    accession-number CDATA #REQUIRED 
    brand NMTOKEN #REQUIRED 
    display-date NMTOKEN #REQUIRED 
    hot NMTOKEN #REQUIRED 
    original-source NMTOKEN #REQUIRED 
    page-citation CDATA #REQUIRED 
    retention NMTOKEN #REQUIRED 
    temp-perm NMTOKEN #REQUIRED> 

<!ELEMENT djn-coding (djn-company,djn-isin,djn-industry,djn-subject, 
         djn-market,djn-product,djn-geo)> 
<!ATTLIST djn-coding 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn-company (c)> 
<!ATTLIST djn-company 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn-isin (c)> 
<!ATTLIST djn-isin 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn-industry (c)+> 
<!ATTLIST djn-industry 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn-subject (c)+> 
<!ATTLIST djn-subject 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn-market (c)+> 
<!ATTLIST djn-market 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn-product (c)+> 
<!ATTLIST djn-product 
    xmlns CDATA #FIXED ''> 

<!ELEMENT djn-geo (c)+> 
<!ATTLIST djn-geo 
    xmlns CDATA #FIXED ''> 

<!ELEMENT c (#PCDATA)> 
<!ATTLIST c 
    xmlns CDATA #FIXED ''>

将此文件写入“djnml-1.0b.dtd”后，我们需要使用XSLT创建标识转换。你可以用TransformerFactory上的newTransformer（）方法来做到这一点，但是这种转换的结果没有很好的说明。使用XSLT将产生更清晰的结果。我们将使用此文件作为我们的身份转换：

<?xml version="1.0" encoding="UTF-8"?> 
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 

    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" omit-xml-declaration="no"/> 

    <xsl:template match="@*|node()"> 
    <xsl:copy> 
     <xsl:apply-templates select="@*|node()"/> 
    </xsl:copy> 
    </xsl:template> 

</xsl:stylesheet>

将上述XSLT文件另存为“identity.xsl”。现在，我们有我们的DTD和我们的身份变换，我们可以转码使用此代码的文件：

import java.io.Closeable; 
import java.io.File; 
import java.io.FileInputStream; 
import java.io.FileNotFoundException; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.OutputStream; 
import java.util.ArrayList; 
import java.util.List; 

import javax.xml.transform.Templates; 
import javax.xml.transform.Transformer; 
import javax.xml.transform.TransformerException; 
import javax.xml.transform.TransformerFactory; 
import javax.xml.transform.sax.SAXSource; 
import javax.xml.transform.stream.StreamResult; 
import javax.xml.transform.stream.StreamSource; 

import org.xml.sax.EntityResolver; 
import org.xml.sax.InputSource; 
import org.xml.sax.SAXException; 
import org.xml.sax.XMLReader; 
import org.xml.sax.helpers.XMLReaderFactory; 

... 

File inFile = new File("20121114000606JA.xml"); 
File outputFile = new File("test.xml"); 
final File dtdFile = new File("djnml-1.0b.dtd"); 
File identityFile = new File("identity.xsl"); 

final List<Closeable> closeables = new ArrayList<Closeable>(); 
try { 
    // We are going to use a SAXSource for input, so that we can specify the 
    // location of the DTD with an EntityResolver. 
    InputStream in = new FileInputStream(inFile); 
    closeables.add(in); 
    InputSource fileSource = new InputSource(); 
    fileSource.setByteStream(in); 
    fileSource.setSystemId(inFile.toURI().toString()); 

    SAXSource source = new SAXSource(); 
    XMLReader reader = XMLReaderFactory.createXMLReader(); 
    reader.setEntityResolver(new EntityResolver() { 
    public InputSource resolveEntity(String publicId, String systemId) 
     throws SAXException, IOException { 
     if (systemId != null && systemId.endsWith("/djnml-1.0b.dtd")) { 
     InputStream dtdIn = new FileInputStream(dtdFile); 
     closeables.add(dtdIn); 

     InputSource inputSource = new InputSource(); 
     inputSource.setByteStream(dtdIn); 
     inputSource.setEncoding("UTF-8"); 

     return inputSource; 
     } 
     return null; 
    } 
    }); 

    source.setXMLReader(reader); 
    source.setInputSource(fileSource); 

    // Now we need to create a StreamResult. 
    OutputStream out = new FileOutputStream(outputFile); 
    closeables.add(out); 
    StreamResult result = new StreamResult(); 
    result.setOutputStream(out); 
    result.setSystemId(outputFile); 

    // Create a templates object for the identity transform. If you are going 
    // to transform a lot of documents, you should do this once and 
    // reuse the Templates object. 
    InputStream identityIn = new FileInputStream(identityFile); 
    closeables.add(identityIn); 
    StreamSource identitySource = new StreamSource(); 
    identitySource.setSystemId(identityFile); 
    identitySource.setInputStream(identityIn); 
    TransformerFactory factory = TransformerFactory.newInstance(); 
    Templates templates = factory.newTemplates(identitySource); 

    // Finally we need to create the transformer and do the transformation. 
    Transformer transformer = templates.newTransformer(); 
    transformer.transform(source, result); 

} finally { 
    // Some older XML processors are bad at cleaning up input and output streams, 
    // so we will do this manually. 
    for (Closeable closeable : closeables) { 
    if (closeable != null) { 
     try { 
     closeable.close(); 
     } catch (Exception e) { 
     } 
    } 
    } 
}

来源

2012-11-16 10:07:04

多字节字符 - 模式匹配

回答

相关问题