问题验证码预处理和用opencv实现解决和pytesseract
我试图用Python编写代码使用的Tesseract-OCR图像预处理和认可。我的目标是可靠地解决这种形式的验证码。
Original captcha and result of each preprocessing step
步骤的现在
灰度和图像的阈值
图像的PIL
提高转换为TIF和规模> 300像素
饲料它的Tesseract-OCR(白名单全部为大写字母)
不过,我仍然得到一个相当不正确的阅读(EPQ中号Q)。我可以采取哪些其他预处理步骤来提高准确度?我的代码和其他类似性质的验证码附在下面。
similar captchas I want to solve
代码
import cv2
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
def binarize_image_using_opencv(captcha_path, binary_image_path='input-black-n-white.jpg'):
im_gray = cv2.imread(captcha_path, cv2.IMREAD_GRAYSCALE)
(thresh, im_bw) = cv2.threshold(im_gray, 85, 255, cv2.THRESH_BINARY)
# although thresh is used below, gonna pick something suitable
im_bw = cv2.threshold(im_gray, thresh, 255, cv2.THRESH_BINARY)[1]
cv2.imwrite(binary_image_path, im_bw)
return binary_image_path
def preprocess_image_using_opencv(captcha_path):
bin_image_path = binarize_image_using_opencv(captcha_path)
im_bin = Image.open(bin_image_path)
basewidth = 300 # in pixels
wpercent = (basewidth/float(im_bin.size[0]))
hsize = int((float(im_bin.size[1])*float(wpercent)))
big = im_bin.resize((basewidth, hsize), Image.NEAREST)
# tesseract-ocr only works with TIF so save the bigger image in that format
tif_file = "input-NEAREST.tif"
big.save(tif_file)
return tif_file
def get_captcha_text_from_captcha_image(captcha_path):
# Preprocess the image befor OCR
tif_file = preprocess_image_using_opencv(captcha_path)
get_captcha_text_from_captcha_image("path/captcha.png")
im = Image.open("input-NEAREST.tif") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('captchafinal.tif')
text = pytesseract.image_to_string(Image.open('captchafinal.tif'), config="-c
tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ -psm 6")
print(text)
感谢您的一些想法。创建一个算法来检测每个字母似乎是一项非常艰巨的任务。现有工具的任何建议来做到这一点?你认为在我的情况下,任何深度学习技术都会占上风吗? –
@SimonHolloway,找到每个字母尝试在opencv中可用的任何图像分割,如流域算法http://docs.opencv.org/3.2.0/d3/db4/tutorial_py_watershed.html –