Python脚本来遍历目录中的PDF并找到匹配的行

当前，我通过电子邮件将所有报告发送给我，并以pdf的形式发送给我。我所做的就是设定Outlook每天自动将这些文件下载到某个目录。有时候，这些PDF文件中没有任何数据，只包含“没有要与选择条件匹配的数据”。我想创建一个python程序，遍历该目录中的每个pdf文件，打开它并查找这些单词，如果它们包含该短语然后删除该特定的pdf。如果他们不这么做，通过帮助reddit我拼凑在一起的代码如下：Python脚本来遍历目录中的PDF并找到匹配的行

import PyPDF2 
import os 

directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\' 
for file in os.listdir(directory): 
    if not file.endswith(".pdf"): 
     continue 
    with open("{}/{}".format(directory,file), 'rb') as pdfFileObj: 
     pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
     pageObj = pdfReader.getPage(0) 
     if "There is no data to present that matches the selection criteria" in pageObj.extractText(): 
      print("{} was removed.".format(file)) 
      os.remove(file)

我测试了3个文件之一包含匹配的短语。不管文件的命名方式如何，它会以什么顺序失败。我已经用名为3.pdf的目录中的一个文件对它进行了测试。下面是错误代码得到。

FileNotFoundError: [WinError 2] The system cannot find the file specified: >'3.pdf'

这将大大减少我的工作量，是一个很好的学习例子，我的新手。所有帮助/批评欢迎。

来源

2017-06-14 user3487244

你有一个斜杠，而不是反斜杠：'{}/{}' – jsmiao

文件路径操作使用字符串替换通常会导致这样的错别字。尝试使用'os.path.join（路径，*路径）'，这里记录：https://docs.python.org/2/library/os.path.html – jsmiao

这里是我的新代码 - > [link] https ：//repl.it/Ilkx/0它给出了一个新的错误信息，可能是进步。错误是'TypeError：expected str，bytes或os.PathLike object，not module'。我确定的是因为我不知道我在做什么。 – user3487244

见下文：

import PyPDF2 
import os 

directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\' 
for file in os.listdir(directory): 
    if not file.endswith(".pdf"): 
     continue 
    with open(os.path.join(directory,file), 'rb') as pdfFileObj: # Changes here 
     pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
     pageObj = pdfReader.getPage(0) 
     if "There is no data to present that matches the selection criteria" in pageObj.extractText(): 
      print("{} was removed.".format(file)) 
      os.remove(file)

来源

2017-06-14 20:04:53 jsmiao

产生错误“FileNotFoundError：[WinError 2]系统找不到指定的文件：'3.pdf'” – user3487244

看起来您需要为'os.remove（file）'指定完整的文件路径。尝试'os.remove（os.path.join（directory，file））'看看它是否工作。 – jsmiao

越来越近！ “ PermissionError：[WinError 32]进程无法访问文件，因为它正在被另一个进程使用：'C：\\ Users \\ jmoorehead \\ Desktop \\ A2IReports \\ 3.pdf'” – user3487244

Python脚本来遍历目录中的PDF并找到匹配的行

回答

相关问题