PyPDF2 - 无法过去。一个大的损坏的文件

我正在检查文件系统中的损坏的PDF。在我正在运行的测试中，有近200k PDF。看起来好像更小的损坏的文件警报正确，但我碰到一个大的15 MB文件损坏，代码只能无限期地挂起。我试过将Strict设置为False而没有运气。这似乎是最初的问题。而不是做线程和设置超时（我曾尝试在过去很少成功），我希望有一个替代方案。PyPDF2 - 无法过去。一个大的损坏的文件

import PyPDF2, os 
from time import gmtime,strftime 

path = raw_input("Enter folder path of PDF files:") 
t = open(r'c:\pdf_check\log.txt','w') 
count = 1 
for dirpath,dnames,fnames in os.walk(path): 
    for file in fnames: 
     print count 
     count = count + 1 
     if file.endswith(".pdf"): 
      file = os.path.join(dirpath, file) 
      try: 
       PyPDF2.PdfFileReader(file,'rb',warndest="c:\test\warning.txt") 
      except PyPDF2.utils.PdfReadError: 
       curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime()) 
       t.write(str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "fail" + "\n") 
      else: 
       pass 
       #curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime()) 
       #t.write(str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "pass" + "\n") 
t.close()

来源

2017-10-12 HMan06

它看起来像有与PyPDF2的问题。我无法实现它，但是，我使用pdfrw，并没有停止在这一点上，并且没有问题地浏览了几十万份文档。

来源

2017-10-13 20:50:42 HMan06

PyPDF2 - 无法过去。一个大的损坏的文件

回答

相关问题