来自PDF的高分辨率图像

我正在开发一个项目，我需要从多页PDF中提取每页TIFF。 PDF只包含图像，每页有一个图像（我相信它们是用某种复印机/扫描仪制作的，但没有证实）。然后使用TIFF创建文档的其他衍生版本，因此分辨率越高越好。来自PDF的高分辨率图像

我发现两个食谱，都有帮助的方面，但都不理想。希望有人能帮我调整其中一个，或者提供第三个选项。

配方1，pdfimages和ImageMagick的：

首先做的事：

$ pdfimages $MY_PDF.pdf foo"

导致几个.pbm文件（命名为foo-000.pbm，foo-001.pbm）等

然后对于每个*.pbm做：

$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif

临：得到的TIFF格式是在长尺寸的健康3300+像素，（调整大小只是用来正常化的一切）

缺点：页面的方向丢失，就出来旋转不同的方向（他们遵循逻辑模式，所以他们可能是他们被送到扫描仪的方向？）。

配方2 ImageMagick的独奏：

convert +adjoin $MY_PDF.pdf pages.tif

这给我的单页TIFF（pages-0.tif，pages-1.tif，等等）。

专业：取向留！

Con：结果文件的长度是< 800像素，它太小而不实用，看起来好像应用了一些压缩。

我该如何消除PDF中图像流的缩放比例，但保留方向？ ImageMagick中是否还有一些我失踪的magick？还有其他的东西吗？

来源

2012-01-11 JStroop

你是否愿意使用非免费的解决方案？ – BitBank 2012-01-12 00:35:16

也许 - 它需要有一个API（没有GUI）并且要合理地集成;我正在处理数以万计的文档。你有什么考虑？ – JStroop 2012-01-12 03:03:23

写信给我的细节，我会看看我是否可以帮忙（[email protected]）。 – BitBank 2012-01-12 03:28:57

对不起，这个老话题了噪音，但谷歌把我在这里作为顶级的结果之一，它可能需要别人，所以我想我'd发布了我在此处找到的TO问题的解决方案：http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick

简而言之：您必须告诉ImageMagick它应该扫描PDF的密度。

因此convert -density 600x600 foo.pdf foo.png会告诉ImageMagick将PDF视为具有600dpi的分辨率，从而输出更大的PNG。在我的情况下，由此产生的foo.png大小为5000x6600px。您可以选择添加-resize 3000x3000或您需要的任何尺寸，并将其缩小。

请注意，只要您的PDF文件中只有矢量图像或文本，密度可能会根据需要设置为高。如果PDF包含光栅化图像，如果将其设置为高于那些图像的dpi，它会看起来不太好，令人惊讶！ :)

克里斯

来源

2013-01-02 09:22:06 Betagan

真棒，谢谢！这很难听，因为我从来没有得到答案。为了完整起见，这里是我的制作单页TIFF，规范大小，并转换为灰度最终配方： '转换+毗 - 密度300×300 -depth 8调整大小3200x3200 \> in.pdf out_prefix.tif' – JStroop 2013-01-02 14:36:35

我想分享我的解决方案......它可能不适用于所有人，但由于没有其他方法可能会帮助其他人。在我的问题中，我首先选择了第一个选项，即使用pdfimages来获取以每个方式旋转的大图像。然后我找到了一种方法来使用OCR和字数来猜测方向，这使我从（估计的）25％精确地旋转到90％以上。

的流程如下：

使用pdfimages（apt-get的安装poppler的-utils的），以获得PBM的一组文件（以下未显示）。
对于每个文件：
1. 制作了四个版本，旋转0，90，180，270度（我称他们为我的代码“北上”，“东”，“南下”和“西进” ）。
2. OCR每个。字数最少的两个可能是右侧上下颠倒的版本。这在我迄今处理的一组图像中精确度超过了99％。
3. 从字数最低的两个字符开始，通过拼写检查运行OCR输出。拼写错误最少的文件（即最可识别的文字）很可能是正确的。对于我的设置，这是约93％（原为25％），准确基于500

因人而异的样本。我的文件是黑色和高度文本的。源图像的长边平均为3300像素。我无法用灰度或颜色或带有大量图像的文件说话。我的大部分PDF文件都是旧影印本的糟糕扫描，因此使用更清晰的文件可能会更准确。在轮换期间使用-despeckle没有任何区别，并且显着减慢了速度（〜5x）。我选择ocrad的速度和准确性，因为我只需要粗略的数字，并抛弃了OCR。回复：性能，我没有什么特别的Linux桌面机器可以运行整个脚本，每秒大约2-3个文件。

下面是一个简单的bash脚本执行：

#!/bin/bash 
# Rotates a pbm file in place. 

# Pass a .pbm as the only arg. 
file=$1 

TMP="/tmp/rotation-calc" 
mkdir $TMP 

# Dependencies:                 
# convert: apt-get install imagemagick           
# ocrad: sudo apt-get install ocrad            
ASPELL="/usr/bin/aspell" 
AWK="/usr/bin/awk" 
BASENAME="/usr/bin/basename" 
CONVERT="/usr/bin/convert" 
DIRNAME="/usr/bin/dirname" 
HEAD="/usr/bin/head" 
OCRAD="/usr/bin/ocrad" 
SORT="/usr/bin/sort" 
WC="/usr/bin/wc" 

# Make copies in all four orientations (the src file is north; copy it to make 
# things less confusing) 
file_name=$(basename $file) 
north_file="$TMP/$file_name-north" 
east_file="$TMP/$file_name-east" 
south_file="$TMP/$file_name-south" 
west_file="$TMP/$file_name-west" 

cp $file $north_file 
$CONVERT -rotate 90 $file $east_file 
$CONVERT -rotate 180 $file $south_file 
$CONVERT -rotate 270 $file $west_file 

# OCR each (just append ".txt" to the path/name of the image) 
north_text="$north_file.txt" 
east_text="$east_file.txt" 
south_text="$south_file.txt" 
west_text="$west_file.txt" 

$OCRAD -f -F utf8 $north_file -o $north_text 
$OCRAD -f -F utf8 $east_file -o $east_text 
$OCRAD -f -F utf8 $south_file -o $south_text 
$OCRAD -f -F utf8 $west_file -o $west_text 

# Get the word count for each txt file (least 'words' == least whitespace junk 
# resulting from vertical lines of text that should be horizontal.) 
wc_table="$TMP/wc_table" 
echo "$($WC -w $north_text) $north_file" > $wc_table 
echo "$($WC -w $east_text) $east_file" >> $wc_table 
echo "$($WC -w $south_text) $south_file" >> $wc_table 
echo "$($WC -w $west_text) $west_file" >> $wc_table 

# Take the bottom two; these are likely right side up and upside down, but 
# generally too close to call beyond that. 
bottom_two_wc_table="$TMP/bottom_two_wc_table" 
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table 

# Spellcheck. The lowest number of misspelled words is most likely the 
# correct orientation. 
misspelled_words_table="$TMP/misspelled_words_table" 
while read record; do 
    txt=$(echo $record | $AWK '{ print $2 }') 
    misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w) 
    echo "$misspelled_word_count $record" >> $misspelled_words_table 
done < $bottom_two_wc_table 

# Do the sort, overwrite the input file, save out the text 
winner=$($SORT -n $misspelled_words_table | $HEAD -1) 
rotated_file=$(echo $winner | $AWK '{ print $4 }') 

mv $rotated_file $file 

# Clean up. 
if [ -d $TMP ]; then 
    rm -r $TMP 
fi

来源

2012-03-19 21:31:43 JStroop

来自PDF的高分辨率图像

回答

相关问题