2013-12-11 26 views
0

我试图从嵌入在网页中的PDF中提取文本。我尝试使用PDF阅读器的宝石,但我得到一个解析错误。我无法从嵌入式PDF中提取数据(Ruby)

`find_first_xref_offset': PDF does not contain EOF marker (PDF::Reader::MalformedPDFError) 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/xref.rb:99:in `load_offsets' 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/xref.rb:60:in `initialize' 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/object_hash.rb:44:in `new' 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/object_hash.rb:44:in `initialize' 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader.rb:117:in `new' 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader.rb:117:in `initialize' 
from role.rb:5:in `new' 
from role.rb:5:in `<main>' 

this is the file

任何人都知道我是如何解决这个问题? 为此有更好的宝石?

谢谢

回答

0

我在Google上查找您的问题时发现了这个问题。它可能提供一些可用于解决问题的方法?

################################################################# 
# Extract text from a PDF file 
# This scraper takes about 2 minutes to run and no output 
# appears until the end. 
################################################################# 
# This scraper uses the pdf-reader gem. 
# Documentation is at https://github.com/yob/pdf-reader#readme 
# If you have problems you can ask for help at http://groups.google.com/group/pdf-reader 
require 'pdf-reader' 
require 'open-uri' 

########## This section contains the callback code that processes the PDF file contents ###### 
class PageTextReceiver 
    attr_accessor :content, :page_counter 
    def initialize 
    @content = [] 
    @page_counter = 0 
    end 
    # Called when page parsing starts 
    def begin_page(arg = nil) 
    @page_counter += 1 
    @content << "" 
    end 
    # record text that is drawn on the page 
    def show_text(string, *params) 
    @content.last << string 
    end 
    # there's a few text callbacks, so make sure we process them all 
    alias :super_show_text :show_text 
    alias :move_to_next_line_and_show_text :show_text 
    alias :set_spacing_next_line_show_text :show_text 
    # this final text callback takes slightly different arguments 
    def show_text_with_positioning(*params) 
    params = params.first 
    params.each { |str| show_text(str) if str.kind_of?(String)} 
    end 
end 
################ End of TextReceiver ############################# 

# If you don't have two minutes to wait you might prefer this 
# smaller pdf 
# pdf = open('http://www.hmrc.gov.uk/factsheets/import-export.pdf') 
# pdf = open('http://www.madingley.org/uploaded/Hansard_08.07.2010.pdf') 
pdf = open('http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf') 

####### Instantiate the receiver and the reader 
receiver = PageTextReceiver.new 
pdf_reader = PDF::Reader.new 
####### Now you just need to make the call to parse... 
pdf_reader.parse(pdf, receiver) 
####### ...and do whatever you want with the text. 
####### This just outputs it. 
receiver.content.each {|r| puts r.strip} 
+0

我仍然有同样的问题。我试图直接访问文件通过网址,并下载PDF在本地阅读。 [这是档案](http://www.tesoreria.cl/portal/portlets/imprimirAR/printAR.do?rutrol=32807514010&t=C&formulario=30&folio=3287514413&vcto=2013-11-30) – felipecamposclarke