使用PDFBox，FontBox等解析PDF到文本的字体问题

我正在使用pdfbox api从pdf中提取文本。
我的程序工作正常，它实际上是从文本的PDF，但问题的字体在PDF中提取文本是华助会-GISTSurekh（印地文字体）和我的程序的输出不是相同的字体是在忙拉。
它甚至不匹配pdf中的文本。
我下载了相同的字体，即CDAC-GISTSurekh（印地文字体），并将其添加到我的电脑字体中，但仍然输出格式为Mangla。
解析时有什么方法可以改变输出的字体。使用PDFBox，FontBox等解析PDF到文本的字体问题

感谢所有帮助..

代码，我已经写了：

 


    import java.io.File; 
    import java.io.FileInputStream; 
    import java.io.IOException; 
    import org.apache.pdfbox.cos.COSDocument; 
    import org.apache.pdfbox.pdfparser.PDFParser; 
    import org.apache.pdfbox.pdmodel.PDDocument; 
    import org.apache.pdfbox.util.PDFTextStripper; 

    public class PDFTextParser { 
     static String pdftoText(String fileName) { 
      PDFParser parser; 
      String parsedText = null; 
      PDFTextStripper pdfStripper = null; 
      PDDocument pdDoc = null; 
      COSDocument cosDoc = null; 
      File file = new File(fileName); 
      if (!file.isFile()) { 
       System.out.println("File " + fileName + " does not exist."); 
       return null; 
      } 
      try { 
       parser = new PDFParser(new FileInputStream(file)); 
      } catch (IOException e) { 
       System.out.println("Unable to open PDF Parser. " + e.getMessage()); 
       return null; 
      } 
      try { 
       parser.parse(); 
       cosDoc = parser.getDocument(); 
       pdfStripper = new PDFTextStripper(); 
       pdDoc = new PDDocument(cosDoc); 
       pdfStripper.setStartPage(1); 
       pdfStripper.setEndPage(5); 
       parsedText = pdfStripper.getText(pdDoc); 
      } catch (Exception e) { 
         e.printStackTrace(); 
       System.out.println("An exception occured in parsing the PDF Document."+ e.getMessage()); 
      } finally { 
       try { 
        if (cosDoc != null) 
         cosDoc.close(); 
        if (pdDoc != null) 
         pdDoc.close(); 
       } catch (Exception e) { 
        e.printStackTrace(); 
       } 
      } 
      return parsedText; 
     } 
     public static void main(String args[]){ 
      System.out.println(pdftoText("J:\\Users\\Shantanu\\Documents\\NetBeansProjects\\Pdf\\src\\PDfman\\A0410001.pdf")); 
     } 
    }

来源

2011-09-17 Shantanu

您是否正在阅读voterid列表。如果是，那么我发现的一件事是，该文本是图像格式，所以它是非常困难的解析。我也试图做同样的事情。你已经成功在解析。 –

当你创建一个新的PdfStripper对象，用户在下面的语法和为其指定的编码。

PdfTextStripper pdfStripper = new PDFTextStripper(ISO-XXXX)

其中（ISO -XXX）是PDF中使用的字符编码。

来源

2012-08-19 01:20:49 Yonkee

你从哪里找到代码？有没有办法找出pdf与之保存的ISO代码？ –

@Yonkee arg中没有这样的构造函数 – varpekv

使用PDFBox，FontBox等解析PDF到文本的字体问题

回答

相关问题