2011-09-20 112 views
0

我试图在将PDFBOX版本升级到1.6.0之后使用Apache Tika解析PDF文件...并且我开始为几个pdf文件获取此错误。 有什么建议吗?解析二进制文件时出错

java.io.IOException: expected='endstream' actual='' [email protected] 
     at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439) 
     at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552) 
     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184) 
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088) 
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053) 
     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74) 
     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) 
     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) 
     at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) 
     at org.apache.tika.Tika.parseToString(Tika.java:357) 
     at edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37) 
     at edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223) 
     at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:461) 
     at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129) 
     at java.lang.Thread.run(Thread.java:662) 
    WARN [Crawler 2] Did not found XRef object at specified startxref position 0 

这是我的代码。

 if (page.isBinary()) { 
         handleBinary(page, curURL); 
        } 
     ------------------------------------------------------------------------------- 

      public int handleBinary(Page page, WebURL curURL) { 
        try { 
         binaryParser.parse(page.getBinaryData()); 
         page.setText(binaryParser.getText()); 
         handleMetaData(page, binaryParser.getMetaData()); 



         //System.out.println(" pdf url " +page.getWebURL().getURL()); 
         //System.out.println("Text" +page.getText()); 
        } catch (Exception e) { 
         // TODO: handle exception 
        } 
        return PROCESS_OK; 
       } 

 public class BinaryParser { 

      private String text; 
      private Map<String, String> metaData; 

      private Tika tika; 

      public BinaryParser() { 
       tika = new Tika(); 
      } 

      public void parse(byte[] data) { 
       InputStream is = null; 
       try { 
        is = new ByteArrayInputStream(data); 
        text = null; 
        Metadata md = new Metadata(); 
        metaData = new HashMap<String, String>(); 
        text = tika.parseToString(is, md).trim(); 
        processMetaData(md); 
       } catch (Exception e) { 
        e.printStackTrace(); 
       } finally { 
        IOUtils.closeQuietly(is); 
       } 
      } 

      public String getText() { 
       return text; 
      } 

      public void setText(String text) { 
       this.text = text; 
      } 


      private void processMetaData(Metadata md){ 
       if ((getMetaData() == null) || (!getMetaData().isEmpty())) { 
        setMetaData(new HashMap<String, String>()); 
       } 
       for (String name : md.names()){ 
        getMetaData().put(name.toLowerCase(), md.get(name)); 
       } 
      } 

      public Map<String, String> getMetaData() { 
       return metaData; 
      } 

      public void setMetaData(Map<String, String> metaData) { 
       this.metaData = metaData; 
      } 

     } 

public class Page { 

     private WebURL url; 

     private String html; 

     // Data for textual content 
     private String text; 

     private String title; 

     private String keywords; 
     private String authors; 
     private String description; 
     private String contentType; 
     private String contentEncoding; 

     // binary data (e.g, image content) 
     // It's null for html pages 
     private byte[] binaryData; 

     private List<WebURL> urls; 

     private ByteBuffer bBuf; 

     private final static String defaultEncoding = Configurations 
       .getStringProperty("crawler.default_encoding", "UTF-8"); 

     public boolean load(final InputStream in, final int totalsize, 
       final boolean isBinary) { 
      if (totalsize > 0) { 
       this.bBuf = ByteBuffer.allocate(totalsize + 1024); 
      } else { 
       this.bBuf = ByteBuffer.allocate(PageFetcher.MAX_DOWNLOAD_SIZE); 
      } 
      final byte[] b = new byte[1024]; 
      int len; 
      double finished = 0; 
      try { 
       while ((len = in.read(b)) != -1) { 
        if (finished + b.length > this.bBuf.capacity()) { 
         break; 
        } 
        this.bBuf.put(b, 0, len); 
        finished += len; 
       } 
      } catch (final BufferOverflowException boe) { 
       System.out.println("Page size exceeds maximum allowed."); 
       return false; 
      } catch (final Exception e) { 
       System.err.println(e.getMessage()); 
       return false; 
      } 

      this.bBuf.flip(); 
      if (isBinary) { 
       binaryData = new byte[bBuf.limit()]; 
       bBuf.get(binaryData); 
      } else { 
       this.html = ""; 
       this.html += Charset.forName(defaultEncoding).decode(this.bBuf); 
       this.bBuf.clear(); 
       if (this.html.length() == 0) { 
        return false; 
       } 
      } 
      return true; 
     } 
    public boolean isBinary() { 
     return binaryData != null; 
    } 

    public byte[] getBinaryData() { 
     return binaryData; 
    } 
+0

这些PDF文件是否可以在其他方面打开?这个错误看起来可能是由于损坏的PDF造成的 – Gagravarr

+0

@Gagravarr,是的,我可以打开所有这些PDF文件......它们没有损坏..!还有其他什么错误? – ferhan

回答

1

你确定你不小心截断的PDF文件时将其加载到页面级的二进制缓冲区?

您的Page.load()方法存在多个潜在问题。首先,finished + b.length > this.bBuf.capacity()应该是finished + len > this.bBuf.capacity(),因为read()方法可能返回少于b.length的字节。另外,你确定你给出的总计参数是准确的吗?最后,可能是给定的文档大于MAX_DOWNLOAD_SIZE限制。

+0

我们有解决这个问题的办法吗? – vinaykumar