2015-06-24 48 views
0

pdf文件中的文本是文本格式,未扫描。 PDFMiner不支持python3,有没有其他解决方案?使用Python3.4 PDF文本提取

+0

https://github.com/mstamy2/PyPDF2? –

+1

有一个PDFMiner库的3k版本:https://pypi.python.org/pypi/pdfminer3k –

回答

2

还有pdfminer2 fork,支持Python 3.4,可以通过pip3获得。 https://github.com/metachris/pdfminer

This thread帮我修补一些东西在一起。

from urllib.request import urlopen 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter import TextConverter 
from pdfminer.layout import LAParams 
from pdfminer.pdfpage import PDFPage 
from io import StringIO, BytesIO 

def readPDF(pdfFile): 
    rsrcmgr = PDFResourceManager() 
    retstr = StringIO() 
    codec = 'utf-8' 
    laparams = LAParams() 
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) 

    interpreter = PDFPageInterpreter(rsrcmgr, device) 
    password = "" 
    maxpages = 0 
    caching = True 
    pagenos=set() 
    for page in PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): 
     interpreter.process_page(page) 

    device.close() 
    textstr = retstr.getvalue() 
    retstr.close() 
    return textstr 

if __name__ == "__main__": 
    #scrape = open("../warandpeace/chapter1.pdf", 'rb') # for local files 
    scrape = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") # for external files 
    pdfFile = BytesIO(scrape.read()) 
    outputString = readPDF(pdfFile) 
    print(outputString) 
    pdfFile.close()  
+0

什么参数应该用于导出HTML文件? –