Apache PDFBOX - 获取java.lang.OutOfMemoryError使用拆分（PDDocument文档）

我想用一个体面的300页使用Apache PDFBOX API V2.0.2拆分文档。在尝试使用下面的代码，以分割pdf文档单页：Apache PDFBOX - 获取java.lang.OutOfMemoryError使用拆分（PDDocument文档）

 PDDocument document = PDDocument.load(inputFile); 
     Splitter splitter = new Splitter(); 
     List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here

我收到以下异常

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

这表明GC是需要很长时间来清除堆是没有理由回收金额。

有许多JVM调优方法可以解决这种情况，但是，所有这些只是治疗症状而不是真正的问题。

最后一个音符，我使用JDK6，因此，使用新的Java 8消费者是不是在我case.Thanks选项

编辑：

这不是HTTP的重复问题： //sackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as：

 
1. I do not have the size problem mentioned in the aforementioned 
    topic. I am slicing a 270 pages 13.8MB PDF file and after slicing 
    the size of each slice is an average of 80KB with total size of 
    30.7MB. 
2. The Split throws the exception even before it returns the splitted parts.

我发现split可以通过只要我没有通过整个文件，而是将其作为“批次”传递，每个文件20-30页，完成这项工作。

来源

2016-07-04 WiredCoder

已知的错误，使用2.0.1直到此是固定的。 –

您是否尝试过Tilman建议的以前的版本？ –

我对版本号有限制@GeorgeGarchagudashvili – WiredCoder

PDF盒存储部分是由于拆分操作堆中的对象，这会导致堆型PDDocument的对象越来越充满快，即使你调用在每一轮后的close（）操作循环，GC仍然无法以与填充相同的方式回收堆大小。

一种选择是分裂文件分割操作，以批次，其中每个批次是一个相对管理块（10〜40页）中2.0.2

public void execute() { 
    File inputFile = new File(path/to/the/file.pdf); 
    PDDocument document = null; 
    try { 
     document = PDDocument.load(inputFile); 

     int start = 1; 
     int end = 1; 
     int batchSize = 50; 
     int finalBatchSize = document.getNumberOfPages() % batchSize; 
     int noOfBatches = document.getNumberOfPages()/batchSize; 
     for (int i = 1; i <= noOfBatches; i++) { 
      start = end; 
      end = start + batchSize; 
      System.out.println("Batch: " + i + " start: " + start + " end: " + end); 
      split(document, start, end); 
     } 
     // handling the remaining 
     start = end; 
     end += finalBatchSize; 
     System.out.println("Final Batch start: " + start + " end: " + end); 
     split(document, start, end); 

    } catch (IOException e) { 
     e.printStackTrace(); 
    } finally { 
     //close the document 
    } 
} 

private void split(PDDocument document, int start, int end) throws IOException { 
    List<File> fileList = new ArrayList<File>(); 
    Splitter splitter = new Splitter(); 
    splitter.setStartPage(start); 
    splitter.setEndPage(end); 
    List<PDDocument> splittedDocuments = splitter.split(document); 
    String outputPath = Config.INSTANCE.getProperty("outputPath"); 
    PDFTextStripper stripper = new PDFTextStripper(); 

    for (int index = 0; index < splittedDocuments.size(); index++) { 
     String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf"; 
     PDDocument splittedDocument = splittedDocuments.get(index); 

     splittedDocument.save(pdfFullPath); 
    } 
}

来源

2016-07-10 17:23:28 WiredCoder

Apache PDFBOX - 获取java.lang.OutOfMemoryError使用拆分（PDDocument文档）

回答

相关问题