如何在Python中正确使用多处理模块？

我有110个PDF文件，我试图从中提取图像。一旦图像被提取，我想删除任何重复项并删除小于4KB的图像。我的代码，这样做看起来像这样：如何在Python中正确使用多处理模块？

def extract_images_from_file(pdf_file): 
    file_name = os.path.splitext(os.path.basename(pdf_file))[0] 
    call(["pdfimages", "-png", pdf_file, file_name]) 
    os.remove(pdf_file) 

def dedup_images(): 
    os.mkdir("unique_images") 
    md5_library = [] 
    images = glob("*.png") 
    print "Deleting images smaller than 4KB and generating the MD5 hash values for all other images..." 
    for image in images: 
     if os.path.getsize(image) <= 4000: 
      os.remove(image) 
     else: 
      m = md5.new() 
      image_data = list(Image.open(image).getdata()) 
      image_string = "".join(["".join([str(tpl[0]), str(tpl[1]), str(tpl[2])]) for tpl in image_data]) 
      m.update(image_string) 
      md5_library.append([image, m.digest()]) 
    headers = ['image_file', 'md5'] 
    dat = pd.DataFrame(md5_library, columns=headers).sort(['md5']) 
    dat.drop_duplicates(subset="md5", inplace=True) 

    print "Extracting the unique images." 
    unique_images = dat.image_file.tolist() 
    for image in unique_images: 
     old_file = image 
     new_file = "unique_images\\" + image 
     shutil.copy(old_file, new_file)

这个过程可能需要一段时间，所以我已经开始在多线程涉足。随意解释，因为我说我不知道我在做什么。我认为这个过程在提取图像方面很容易并行，但是不能进行重复数据删除，因为有很多I/O正在进行，我不知道该怎么做。因此，这里是我的尝试在并行处理：

if __name__ == '__main__': 
    filepath = sys.argv[1] 
    folder_name = os.getcwd() + "\\all_images\\" 
    if not os.path.exists(folder_name): 
     os.mkdir(folder_name) 
    pdfs = glob("*.pdf") 
    print "Copying all PDFs to the images folder..." 
    for pdf in pdfs: 
     shutil.copy(pdf, ".\\all_images\\") 
    os.chdir("all_images") 
    pool = Pool(processes=8) 
    print "Extracting images from PDFs..." 
    pool.map(extract_images_from_file, pdfs) 
    print "Extracting unique images into a new folder..." 
    dedup_images() 
    print "All images have been extracted and deduped."

一切似乎提取图像时，都工作得很好，但后来这一切失控了。所以这里是我的问题：

1）我是否正确设置并行进程？
2）它是否继续尝试使用dedup_images()上的所有8个处理器？
3）有什么我失踪和/或没有正确地做？

在此先感谢！

编辑这是我的意思是“干草”。这些错误开始时有这样的一堆线：

I/O Error: Couldn't open image If/iOl eE r'rSourb:p oICe/onOua l EdNrner'wot r Y:oo prCekon u Cliodmunan'gttey of1pi0e 
l2ne1 1i'4mS auogbiepl o2fefinrlaee e [email protected]'egSwmu abYipolor ekcn oaCm o Nupentwt y1Y -o18r16k11 8.C1po4nu gn3't4 
y7 5160120821143 3p4t7I 9/49O-8 88E78r81r.3op rnp:gt ' C 
3o-u3l6d0n.'ptn go'p 
en image file 'Ia/ ON eEwr rYoorr:k CCIoo/uuOln dtEnyr' rt1o 0ro2:p1 e1Cn4o uiolmidalng2'eft r m ' 
ai gpceoo emfn iapl teN e1'w-S 8uY6bo2pr.okpe nnCgao' u 
Nnetwy Y1o0r2k8 1C4o u3n4t7y9 918181881134 3p4t7 536-1306211.3p npgt' 
4-879.png' 
I/O Error: CoulId/nO' tE rorpoern: iCmoaugled nf'itl eo p'eub piomeangae fNielwe Y'oSrukb pCooeunnat yN e1w0 2Y8o1r 
4k 3C4o7u9n9t8y8 811032 1p1t4 3o-i3l622f pt 1-863.png'

然后变得更具可读性多行这样的：

I/O Error: Couldn't open image file 'pt 1-864.png' 
I/O Error: Couldn't open image file 'pt 1-865.png' 
I/O Error: Couldn't open image file 'pt 1-866.png' 
I/O Error: Couldn't open image file 'pt 1-867.png'

这重复了一会儿，乱码之间来回错误文本和可读性。

最后，它会到这里：

Deleting images smaller than 4KB and generating the MD5 hash values for all other images... 
Extracting unique images into a new folder...

这意味着该代码拿起备份，并与过程继续。可能会出现什么问题？

来源

2015-10-02 brittenb

对我来说这看起来还行。你能更具体地说“去干草”吗？ – strubbly

@strubbly我添加了上面的错误输出。 – brittenb

“我已经开始涉足多线程了，随着我说我不知道我在做什么，你可以随意解释”你和其他开始使用并发的人。 –

您的代码基本上是好的。

乱码文本是所有尝试写入交错控制台的不同版本的I/O Error消息的进程。错误消息是由pdfimages命令生成的，可能是因为当你同时运行两个它们时，它们可能会通过临时文件或两者使用相同的文件名或类似的东西。

尝试为每个单独的pdf文件使用不同的图像根。

来源

2015-10-02 22:32:36 strubbly

我接受了这个答案，因为它有效地解决了我遇到的问题。我将随机的3位字母数字代码附加到根名称，并且它完全缓解了任何问题。谢谢！ – brittenb

很酷 - 你在多处理方面做得很好 - 只要记住你调用的东西需要能够一起运行。他们在共享资源（如目录或文件）时可能会发生冲突。 – strubbly

是的，Pool.map采用一个函数带1个参数，然后是一个列表，其中的每个元素都作为参数传递给第一个函数。
没有，因为你已经写在这里一切都在原来的进程中运行，除了的extract_images_from_file()身体。另外，我会认为你正在使用8个过程，不处理器指出。如果您恰好拥有一个8核英特尔CPU，并且启用了超线程功能，则您可以同时运行16个进程。
对我来说这看起来很好，除非如果extract_images_from_file()引发异常，它会将您的整个Pool炸毁，这可能不是您想要的。为了防止这种情况，你可以试试这个块。

你正在处理的“干扰线”的性质是什么？我们可以看到例外文本吗？

来源

2015-10-02 15:56:17 user2993124

我已将错误输出添加到问题中。 – brittenb

如何在Python中正确使用多处理模块？

回答

相关问题