2014-02-10 71 views
0

我与XML很新,和一个坏消息是,我有以下结构的XML:拆分XML转换成指定大小的小XML文件

<record> 
    <record_id>200</record_id> 
    <record_rows> 
     <record_row>some text</record_row> 
     ................................. 
    </record_rows> 
</record> 

记录行数是每个记录不同,所以,每个记录的大小都不相同。我的任务是将文件(大于1GB)分割成指定大小的单独xml文件。哪个解析器是最好的?此外,我想我应该采用一些唱片选择策略,以接近目标大小(并且我无法想象任何在考虑到输入文件大小和下一个记录大小的不可预测性)

唯一的希望是你,我的朋友们。你会如何处理这个问题?

+0

是否大小必须是准确的? (如果这样的文件需要_valid_ XML)? –

+0

文件应尽可能接近指定的大小,但不是确切的。文件应该是有效的XML – StackExploded

+1

“哪个分析器”是一个意见问题。所以“实际上你会怎么做”......但我自己的建议是修改标准的SAX读写回写示例,以确认每次退出“”时,它应该检查输出文档的长度,并且如果距离边界太近,就会终止该文件并开始一个新的。 – keshlam

回答

1

假设您的记录行不超过您单个文件的期望大小,您可以使用SAX解析器按顺序读取文件并对读取的字符进行计数,将迄今为止读取的数据存储在缓冲区中。当字符计数达到一个接近您的大小限制的值时,它将创建一个仅包含迄今为止读取的记录的新文件,重置缓冲区和字符计数,并将继续读取另一个集合,直到再次达到限制,并且等等。最后,您将拥有一组大小基本相同的文件(除了最后一个可能更小)以及包含相同数据的文件。

要使用SAX解析器,您将需要一个包含下面的代码的可执行文件:(相对于在运行该应用程序)

import java.io.*; 
import javax.xml.parsers.*; 
import org.xml.sax.*; 

public class SAXReader { 

    public static final String PATH = "src/main/resources"; 

    public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException { 
     SAXParserFactory spf = SAXParserFactory.newInstance(); 
     SAXParser sp = spf.newSAXParser(); 
     XMLReader reader = sp.getXMLReader(); 
     reader.setContentHandler(new DataSaxHandler()); // need to implement this file 
     reader.parse(new InputSource(new FileInputStream(new File(PATH, "data.xml")))); 
    } 
} 

你的XML文件存储在src/main/resources/data.xml。你可能想改变它。

如果分割文件是格式良好的XML,它们也应该有一个根元素,并且可能保留诸如record_id之类的信息,以便您可以知道它们来自哪条记录。我添加了一个属性part,其中包含排序文件片段的顺序号。生成的文件看起来像这样:

data_part_1.xml

<record part='1'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row> ... <record_row>...</record_row></record_rows></record> 

data_part_2.xml

<record part='2'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row> ... <record_row>...</record_row></record_rows></record> 

...

data_part_n.xml

<record part='n'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row><record_row>...</record_row><record_row>...</record_row></record_rows></record> 

其中'n'是创建的文件数。

实现此结果的SAX ContentHandler实现如下所示。你可能想改变DIRECTORYMAX_SIZE常数:

import java.io.*; 
import org.xml.sax.*; 
import org.xml.sax.helpers.DefaultHandler; 

class DataSaxHandler extends DefaultHandler { 

    // Change this to the directory where the files will be stored 
    public static final String DIRECTORY = "target/results"; 

    // Change this to the approximate size of the resulting files (in characters(
    public static final long MAX_SIZE = 1024; 


    public static final long TAG_CHAR_SIZE = 5; //"<></>" 

    // counts number of files created 
    private int fileCount = 0; 

    // counts characters to decide where to split file 
    private long charCount = 0; 
    // data line buffer (is reset when the file is split) 
    private StringBuilder recordRowDataLines = new StringBuilder(); 

    // temporary variables used for the parser events 
    private String currentElement = null; 
    private String currentRecordId = null; 
    private String currentRecordRowData = null; 

    @Override 
    public void startDocument() throws SAXException { 
     File dir = new File(DIRECTORY); 
     if (!dir.exists()) { 
      dir.mkdir(); 
     } 
    } 

    @Override 
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException { 
     currentElement = qName; 
    } 

    @Override 
    public void endElement(String uri, String localName, String qName) throws SAXException { 
     if (qName.equals("record_rows")) { // no more records - save last file here! 
      try { 
       saveFragment(); 
      } catch (IOException ex) { 
       throw new SAXException(ex); 
      } 
     } 
     if (qName.equals("record_row")) { // one record finished - save in buffer & calculate size so far 
      charCount += tagSize("record_row"); 
      recordRowDataLines.append("<record_row>") 
           .append(currentRecordRowData) 
           .append("</record_row>"); 
      if (charCount >= MAX_SIZE) { // if max size was reached, save what was read so far in a new file 
       try { 
        saveFragment(); 
       } catch (IOException ex) { 
        throw new SAXException(ex); 
       } 
      } 
     } 
     currentElement = null; 
    } 

    @Override 
    public void characters(char[] ch, int start, int length) throws SAXException { 
     System.out.println(new String(ch, start, length)); 
     if (currentElement == null) { 
      return; 
     } 
     if (currentElement.equals("record_id")) { 
      currentRecordId = new String(ch, start, length); 
     } 
     if (currentElement.equals("record_row")) { 
      currentRecordRowData = new String(ch, start, length); 
      charCount += currentRecordRowData.length(); // storing size so far 
     } 
    } 

    public long tagSize(String tagName) { 
     return TAG_CHAR_SIZE + tagName.length() * 2; // size of text + tags 
    } 

    /** 
    * Saves a new file containing approximately MAX_SIZE in chars 
    */ 
    public void saveFragment() throws IOException { 
     ++fileCount; 
     StringBuilder fileContent = new StringBuilder(); 
     fileContent.append("<record part='") 
        .append(fileCount) 
        .append("'><record_id>") 
        .append(currentRecordId) 
        .append("</record_id>") 
        .append("<record_rows>") 
        .append(recordRowDataLines) 
        .append("</record_rows></record>"); 
     File fragment = new File(DIRECTORY, "data_part_" + fileCount + ".xml"); 
     FileWriter out = new FileWriter(fragment); 
     out.write(fileContent.toString()); 
     out.flush(); 
     out.close(); 

     // reset fragment data - record buffer and char count 
     recordRowDataLines = new StringBuilder(); 
     charCount = 0; 
    } 

}