如何从文档图像中检测文本区域？

我有一份文件图像，可能是报纸或杂志。例如，一份扫描的报纸。我想删除全部/大部分文本并在文档中保留图像。任何人都知道如何检测文档中的文本区域？下面是一个例子。提前致谢！如何从文档图像中检测文本区域？

示例图像：https://www.mathworks.com/matlabcentral/answers/uploaded_files/21044/6ce011abjw1elr8moiof7j20jg0w9jyt.jpg

2014-11-16 kim

物体识别的常用模式将在这里工作 - 门槛，检测区，过滤区，那么你需要与其他区域的。

阈值在这里很容易。背景是纯白色的（或者可以被过滤为纯白色），因此倒置灰度图像中大于0的任何东西都是文本或图像。然后可以在这个阈值二值图像内检测区域。

对于过滤区域，我们只需要确定是什么使文本与图片不同。由于每个字母都是自己的区域，因此文本区域会变小。图片是比较大的地区。假设没有任何图片与页面上任何位置的单个字母的大小有关，则按照具有适当阈值的区域区域进行过滤将拉出所有图片并删除所有文本。如果他们那么可以使用其他过滤标准（饱和度，色调方差，...）。

一旦区域被区域和饱和度标准过滤，那么可以通过将原始图像中落入过滤区域的边界框内的像素插入到新图像中来创建新图像。

MATLAB实现：

%%%%%%%%%%%% 
% Set these values depending on your input image 

img = imread('https://www.mathworks.com/matlabcentral/answers/uploaded_files/21044/6ce011abjw1elr8moiof7j20jg0w9jyt.jpg'); 

MinArea = 2000; % Minimum area to consider, in pixels 
%%%%%%%%% 
% End User inputs 

gsImg = 255 - rgb2gray(img); % convert to grayscale (and invert 'cause that's how I think) 
threshImg = gsImg > graythresh(gsImg)*max(gsImg(:)); % Threshold automatically 

% Detect regions, using the saturation in place of 'intensity' 
regs = regionprops(threshImg, 'BoundingBox', 'Area'); 

% Process regions to conform to area and saturation thresholds 
regKeep = false(length(regs), 1); 
for k = 1:length(regs) 

    regKeep(k) = (regs(k).Area > MinArea); 

end 

regs(~regKeep) = []; % Delete those regions that don't pass qualifications for image 

% Make a new blank image to hold the passed regions 
newImg = 255*ones(size(img), 'uint8'); 

for k = 1:length(regs) 

    boxHere = regs(k).BoundingBox; % Pull out bounding box for current region 
    boxHere([1 2]) = floor(boxHere([1 2])); % Round starting points down to next integer 
    boxHere([3 4]) = ceil(boxHere([3 4])); % Round ranges up to next integer 
    % Insert pixels within bounding box from original image into the new 
    % image 
    newImg(boxHere(2):(boxHere(2)+boxHere(4)), ... 
     boxHere(1):(boxHere(1)+boxHere(3)), :) = img(boxHere(2):(boxHere(2)+boxHere(4)), ... 
     boxHere(1):(boxHere(1)+boxHere(3)), :); 

end 

% Display 
figure() 
image(newImg);

正如你可以在下面链接的图片中看到，它需要什么。除图片和标头以外的所有内容均已删除。好消息是，如果您正在从头版开始处理报纸，那么对于彩色和灰度图像来说，这可以很好地工作。

结果：

http://imgur.com/vEmpavY,dd172fr#1

来源

2014-11-17 00:17:19 Staus

谢谢您的回答！我 – kim

感谢您的回答！这是相当令人印象深刻的！正如你所说，它适用于纯白色背景。如果背景和文字不是白色和黑色呢？例如，这种情况下，背景色为绿色，文字为白色。您的信息和代码将受到高度赞赏。我碰巧遇到了一个你的代码失败的案例。我可以给你发电子邮件，你看看吗？由于该图像受版权保护，所以我没有在这里发布。我能知道你的电子邮件吗？非常感谢你！ :) – kim

无论如何处理其他背景颜色？谢谢！ – kim

如何从文档图像中检测文本区域？

回答

相关问题