验证码预处理和用opencv实现解决和pytesseract

我试图用Python编写代码使用的Tesseract-OCR图像预处理和认可。我的目标是可靠地解决这种形式的验证码。

Original captcha and result of each preprocessing step

步骤的现在

灰度和图像的阈值
图像的PIL
提高转换为TIF和规模> 300像素
饲料它的Tesseract-OCR（白名单全部为大写字母）

不过，我仍然得到一个相当不正确的阅读（EPQ中号Q）。我可以采取哪些其他预处理步骤来提高准确度？我的代码和其他类似性质的验证码附在下面。

代码

import cv2 
import pytesseract 
from PIL import Image, ImageEnhance, ImageFilter 
def binarize_image_using_opencv(captcha_path, binary_image_path='input-black-n-white.jpg'): 
    im_gray = cv2.imread(captcha_path, cv2.IMREAD_GRAYSCALE) 
    (thresh, im_bw) = cv2.threshold(im_gray, 85, 255, cv2.THRESH_BINARY) 
    # although thresh is used below, gonna pick something suitable 
    im_bw = cv2.threshold(im_gray, thresh, 255, cv2.THRESH_BINARY)[1] 
    cv2.imwrite(binary_image_path, im_bw) 

    return binary_image_path 

def preprocess_image_using_opencv(captcha_path): 
    bin_image_path = binarize_image_using_opencv(captcha_path) 

    im_bin = Image.open(bin_image_path) 
    basewidth = 300 # in pixels 
    wpercent = (basewidth/float(im_bin.size[0])) 
    hsize = int((float(im_bin.size[1])*float(wpercent))) 
    big = im_bin.resize((basewidth, hsize), Image.NEAREST) 

    # tesseract-ocr only works with TIF so save the bigger image in that format 
    tif_file = "input-NEAREST.tif" 
    big.save(tif_file) 

    return tif_file 

def get_captcha_text_from_captcha_image(captcha_path): 

    # Preprocess the image befor OCR 
    tif_file = preprocess_image_using_opencv(captcha_path) 



get_captcha_text_from_captcha_image("path/captcha.png") 

im = Image.open("input-NEAREST.tif") # the second one 
im = im.filter(ImageFilter.MedianFilter()) 
enhancer = ImageEnhance.Contrast(im) 
im = enhancer.enhance(2) 
im = im.convert('1') 
im.save('captchafinal.tif') 
text = pytesseract.image_to_string(Image.open('captchafinal.tif'), config="-c 
tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ -psm 6") 
print(text)

来源

2017-08-14 Simon Holloway

主要问题来自于字母不同的方向，而不是从预处理阶段。你做了普通的预处理，应该可以很好地工作，但是你可以用adaptive thresholding来代替阈值处理，使你的程序在图像亮度的意义上更加通用。

我在使用tesseract进行汽车牌照识别时遇到了同样的问题。从那次经历中，我意识到tesseract对于图像文本的定位非常有意义。当图像上的文字是水平的时，Tesseract可以很好地识别字母。水平导向越多，文字就越好。

所以你必须创建算法，它会检测你的验证码图像中的每个字母，检测它的方向并旋转它使其水平，然后进行预处理，然后使用tesseract处理这个旋转的水平图像并存储其输出在你的结果字符串中。然后去检测下一个字母，并执行相同的过程，并在您的结果字符串中添加tesseract输出。你也需要image transformation function来旋转你的信件。你必须考虑找到你检测到的字母的角落。可能是this project会帮助你，因为他们旋转图像上的文字，以提高质量tesseract。

来源

2017-08-14 18:54:25

感谢您的一些想法。创建一个算法来检测每个字母似乎是一项非常艰巨的任务。现有工具的任何建议来做到这一点？你认为在我的情况下，任何深度学习技术都会占上风吗？ –

@SimonHolloway，找到每个字母尝试在opencv中可用的任何图像分割，如流域算法http://docs.opencv.org/3.2.0/d3/db4/tutorial_py_watershed.html –

验证码预处理和用opencv实现解决和pytesseract

回答

相关问题