Python：urlretrieve PDF下载

我在Python中使用urllib的urlretrieve（）函数来尝试从网站获取一些pdf。它（至少对我而言）停止工作，正在下载损坏的数据（15 KB而不是164 KB）。Python：urlretrieve PDF下载

我用几个pdf测试过了，都没有成功（即random.pdf）。我似乎无法使其工作，并且我需要能够为我正在处理的项目下载pdf。

这里是我使用的下载PDF格式的（和分析使用pdftotext.exe文本）的那种代码的例子：我是新手程序员

def get_html(url): # gets html of page from Internet 
    import os 
    import urllib2 
    import urllib 
    from subprocess import call 
    f_name = url.split('/')[-2] # get file name (url must end with '/') 
    try: 
     if f_name.split('.')[-1] == 'pdf': # file type 
      urllib.urlretrieve(url, os.getcwd() + '\\' + f_name) 
      call([os.getcwd() + '\\pdftotext.exe', os.getcwd() + '\\' + f_name]) # use xpdf to output .txt file 
      return open(os.getcwd() + '\\' + f_name.split('.')[0] + '.txt').read() 
     else: 
      return urllib2.urlopen(url).read() 
    except: 
     print 'bad link: ' + url  
     return ""

，所以任何输入将是巨大的！谢谢

来源

2013-02-03 hisroar

我会建议尝试requests。这是一个非常好的库，它隐藏了一个简单的API后面的所有实现。

>>> import requests 
>>> req = requests.get("http://www.mathworks.com/moler/random.pdf") 
>>> len(req.content) 
167633 
>>> req.headers 
{'content-length': '167633', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.3 (Red Hat) mod_jk/1.2.31 PHP/5.3.13 Phusion_Passenger/3.0.9 mod_perl/2.0.4 Perl/v5.8.8', 'last-modified': 'Fri, 15 Feb 2008 17:11:12 GMT', 'connection': 'keep-alive', 'etag': '"30863b-28ed1-446357e3d4c00"', 'date': 'Sun, 03 Feb 2013 05:53:21 GMT', 'content-type': 'application/pdf'}

顺便说一句，你只得到15kb下载的原因是因为你的网址是错误的。它应该是

http://www.mathworks.com/moler/random.pdf

但你歌厅

http://www.mathworks.com/moler/random.pdf/ 

>>> import requests 
>>> c = requests.get("http://www.mathworks.com/moler/random.pdf/") 
>>> len(c.content) 
14390

来源

2013-02-03 05:54:32 sberry

哇，这似乎很奇怪，谢谢你告诉我有关请求。 – hisroar

将文件写入到光盘：

myfile = open("out.pdf", "w") 
myfile.write(req.content)

来源

2015-06-27 19:08:45 user1767754

试图做到这一点，我得到的是一个难以理解的.pdf任何想法？ –

也许它有点晚了，但你可以试试这个：只是写将内容添加到一个新文件并使用textract读取它，因为没有它，给了我不想要的包含'＃$'的文本。

import requests 
import textract 
url = "The url which downloads the file" 
response = requests.get(url) 
with open('./document.pdf', 'wb') as fw: 
    fw.write(response.content) 
text = textract.process("./document.pdf") 
print('Result: ', text)

来源

2017-05-30 10:21:19 Arjunsingh

Python：urlretrieve PDF下载

回答

相关问题