如何使用Apache POI从MS Word文档的文本框中获取文本？

我想在MS Word文档中获取用Textbox写的信息。我使用Apache POI来解析word文档。如何使用Apache POI从MS Word文档的文本框中获取文本？

目前我遍历所有的段落对象，但这段落列表不包含来自TextBox的信息，所以我在输出中缺少这些信息。

例如

paragraph in plain text 

**<some information in text box>** 

one more paragraph in plain text

什么我想提取：

<para>paragraph in plain text</para> 

<text_box>some information in text box</text_box> 

<para>one more paragraph in plain text</para>

什么我得到目前：

款明文

以纯文本

多了一个段落

任何人都知道如何使用Apache POI从文本框中提取信息？

来源

2011-03-28 Shekhar

格式：doc或docx？ – JasonPlutext 2011-03-30 11:25:56

@plutext，以doc格式开头，但后来需要为docx和rtf做同样的事情。 – Shekhar 2011-03-31 10:44:51

您可以考虑使用JODConverter + LibreOffice将所有三种格式转换为docx，然后使用POI（或docx4j）从docx中提取文本框内容。这样你就不必担心二进制格式，或者解析rtf。 – JasonPlutext 2011-03-31 12:07:15

这为我工作，

private void printContentsOfTextBox(XWPFParagraph paragraph) { 

    XmlObject[] textBoxObjects = paragraph.getCTP().selectPath(" 
     declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' 
     declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape' .//*/wps:txbx/w:txbxContent"); 

    for (int i =0; i < textBoxObjects.length; i++) { 
     XWPFParagraph embeddedPara = null; 
     try { 
     XmlObject[] paraObjects = textBoxObjects[i]. 
      selectChildren(
      new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "p")); 

     for (int j=0; j<paraObjects.length; j++) { 
      embeddedPara = new XWPFParagraph(
       CTP.Factory.parse(paraObjects[j].xmlText()), paragraph.getBody()); 
      //Here you have your paragraph; 
      System.out.println(embeddedPara.getText()); 
     } 

     } catch (XmlException e) { 
     //handle 
     } 
    } 

}

来源

2014-09-16 19:49:42 Chinmay

更新：原来并非所有的文本框都在给定示例中的模式中。这是另一个，textBoxObjects.addAll（Arrays.asList（paragraph.getCTP（）。selectPath（“declare namespace w ='http：//schemas.openxmlformats.org/wordprocessingml/2006/main'declare namespace v ='urn：schemas- microsoft-com：vml'.//*/v:textbox/w:txbxContent“）））;.如果有人请分享，我没有找到OOXML架构定义提供的文本框架完整列表。 – Chinmay 2014-09-18 17:15:21

如果你想从在的docx文件文本框的文本（使用POI 3.10决赛）这里是示例代码：

FileInputStream fileInputStream = new FileInputStream(inputFile); 
XWPFDocument document = new XWPFDocument(OPCPackage.open(fileInputStream)); 
for (XWPFParagraph xwpfParagraph : document.getParagraphs()) { 
    String text = xwpfParagraph.getParagraphText(); //here is where you receive text from textbox 
}

或者你可以遍历每个 XWPFRun在XWPFParagraph和调用的toString（）方法。同样的结果。

来源

2014-04-23 09:16:45

要提取从Word doc和docx文件文本的所有匹配的crgrep我用Apache Tika源作为如何在Apache POI API应该正确使用参考。如果您想直接使用POI而不依赖Tika，这很有用。

对于Word .DOCX文件，看看这个提卡类：

org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator

如果你忽视XHTMLContentHandler和格式代码，你可以看到如何正确导航XWPFDocument使用POI。对于.doc文件这个类是有帮助的：

org.apache.tika.parser.microsoft.WordExtractor

无论是从tika-parsers-1.x.jar。一个简单的方法，通过你的Maven的依赖访问蒂卡代码是临时添加到蒂卡你的pom.xml如

<dependency> 
    <groupId>org.apache.tika</groupId> 
    <artifactId>tika-parsers</artifactId> 
    <version>1.7</version> 
</dependency>

让你的IDE解决连接源，并进入阶级之上。

来源

2015-03-14 06:47:28 Craig

如何使用Apache POI从MS Word文档的文本框中获取文本？

回答

相关问题