的Java读取使用POI

你好我试图读取DOC和DOCX文件中的文本.doc文件，对DOC文件我这样做的Java读取使用POI

package test; 
import java.io.File; 
import java.io.FileInputStream; 
import org.apache.poi.hwpf.HWPFDocument; 
import org.apache.poi.hwpf.extractor.WordExtractor; 

public class ReadFile { 
public static void main(String[] args) { 
     File file = null; 
     WordExtractor extractor = null; 
     try { 

      file = new File("C:\\Users\\rijo\\Downloads\\r.doc"); 
      FileInputStream fis = new FileInputStream(file.getAbsolutePath()); 
      HWPFDocument document = new HWPFDocument(fis); 
      extractor = new WordExtractor(document); 
      String fileData = extractor.getText(); 
      System.out.println(fileData); 
     } catch (Exception exep) { 
     } 
    } 
}

但是这给了我一个org/apache/poi/OldFileFormatException例外。

任何想法如何解决这个问题？

此外我需要阅读Docx和PDF文件？任何好的方法来读取所有类型的文件？

来源

2013-10-14 Rijo Joseph

您使用的是哪个版本的POI？ – Paolo

如果你看看OldFileFormatException的javadoc，就可以看到该

基类中的所有异常的是POI抛出在它给了一个文件，该文件早于当前支持的事件的原因。

这意味着您使用的r.doc不受HWPFDocument的支持。可能是它支持最新格式（docx也有相当长的一段时间了。不知道ApachePOI是否支持doc格式在HWPFDocument）。

来源

2013-10-14 10:58:32 SudoRahul

我尝试使用.docx文件，但得到相同的异常..你知道任何其他方式来阅读所有.doc .docx .pdf文件？ –

使用下面的罐（如果版本号都在这里扮演一个角色）：

dom4j-1.7-20060614 
poi-3.9-20121203 
poi-ooxml-3.9-20121203 
poi-ooxml-schemas-3.9-20121203 
poi-scratchpad-3.9-20121203 
xmlbeans-2.4.0

我打这件事：

import java.io.File; 
import java.io.FileInputStream; 
import java.io.FileNotFoundException; 
import java.io.IOException; 

import org.apache.poi.xwpf.extractor.XWPFWordExtractor; 
import org.apache.poi.xwpf.usermodel.XWPFDocument; 
import org.apache.poi.hwpf.HWPFDocument; 
import org.apache.poi.hwpf.extractor.WordExtractor; 

public class SO { 
public static void main(String[] args){ 

      //Alternate between the two to check what works. 
    //String FilePath = "D:\\Users\\username\\Desktop\\Doc1.docx"; 
    String FilePath = "D:\\Users\\username\\Desktop\\Bob.doc"; 
    FileInputStream fis; 

    if(FilePath.substring(FilePath.length() -1).equals("x")){ //is a docx 
    try { 
     fis = new FileInputStream(new File(FilePath)); 
     XWPFDocument doc = new XWPFDocument(fis); 
     XWPFWordExtractor extract = new XWPFWordExtractor(doc); 
     System.out.println(extract.getText()); 
    } catch (IOException e) { 

     e.printStackTrace(); 
    } 
    } else { //is not a docx 
     try { 
      fis = new FileInputStream(new File(FilePath)); 
      HWPFDocument doc = new HWPFDocument(fis); 
      WordExtractor extractor = new WordExtractor(doc); 
      System.out.println(extractor.getText()); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
    } 
    } 
}

这让我读，无论从.DOCX文本和.doc。如果这在您的电脑上无法正常工作，您可能会遇到与您正在使用的外部容器有关的问题。

尽管:) 祝你好运！

来源

2013-10-14 13:08:45 Levenal

@RijoJoseph我已根据您先前的评论更新了我的答案。 – Levenal

我不知道为什么你只使用WordExtractor从.doc中获取文本。对我来说是足够用了一个方法：

import org.apache.poi.hwpf.HWPFDocument; 
... 
File fin = new File(yourFilePath); 
FileInputStream fis = new FileInputStream(fin); 
HWPFDocument doc = new HWPFDocument(fis); 
String text = doc.getDocumentText(); 
System.out.println(text); 
...

要以.PDF工作使用其他的Apache：pdfbox。

来源

2015-10-27 09:57:52

的Java读取使用POI

回答

相关问题