2014-01-12 41 views
0

有没有什么办法从通过Google app引擎上传的PDF文件中提取文本和documentInfo?我想用PyPDF2,和我的代码是这样的:如何从使用PyPDF2上传到Google App Engine的PDF中提取文本?

pdf_file = self.request.POST['file'].file 
pdf_reader = pypdf.PdfFileReader(pdf_file) 

这给了我错误:

Traceback (most recent call last): 
.... 
    File "/myrepo/myproj/main.py", line 154, in post 
    pdf_text = pypdf.PdfFileReader(pdf_file) 
    File "lib/PyPDF2/pdf.py", line 649, in __init__ 
    self.read(stream) 
    File "lib/PyPDF2/pdf.py", line 1100, in read 
    raise utils.PdfReadError, "EOF marker not found" 
PdfReadError: EOF marker not found 

它给这个错误的任何文件,甚至对于那些能够成功地从文件上阅读磁盘通过open(filename, 'r')

我错过了什么?提前致谢!

回答

1

的解决方案是使用从get_uploadsblobstore_handlers.BlobstoreUploadHandler

from google.appengine.ext.webapp import blobstore_handlers 
from cStringIO import StringIO 
import PyPDF2 

class UploadHandler(blobstore_handlers.BlobstoreUploadHandler): 
    def post(self): 
     upload_files = self.get_uploads('file') 
     blob_info = upload_files[0] 
     blob_reader = blobstore.BlobReader(blob_info) 
     blob_content = StringIO(blob_reader.read()) 
     pdf_info = PyPDF2.PdfFileReader(blob_content) 
相关问题