2013-04-18 89 views
1

我正在使用Python 3.3.1。我创建了一个名为download_file()的函数,该函数下载文件并将其保存到磁盘。为什么不下载文本文件正常工作?

#!/usr/bin/python3 
# -*- coding: utf8 -*- 

import datetime 
import os 
import urllib.error 
import urllib.request 


def download_file(*urls, download_location=os.getcwd(), debugging=False): 
    """Downloads the files provided as multiple url arguments. 

    Provide the url for files to be downloaded as strings. Separate the 
    files to be downloaded by a comma. 

    The function would download the files and save it in the folder 
    provided as keyword-argument for download_location. If 
    download_location is not provided, then the file would be saved in 
    the current working directory. Folder for download_location would be 
    created if it doesn't already exist. Do not worry about trailing 
    slash at the end for download_location. The code would take carry of 
    it for you. 

    If the download encounters an error it would alert about it and 
    provide the information about the Error Code and Error Reason (if 
    received from the server). 

    Normal Usage: 
    >>> download_file('http://localhost/index.html', 
         'http://localhost/info.php') 
    >>> download_file('http://localhost/index.html', 
         'http://localhost/info.php', 
         download_location='/home/aditya/Download/test') 
    >>> download_file('http://localhost/index.html', 
         'http://localhost/info.php', 
         download_location='/home/aditya/Download/test/') 

    In Debug Mode, files are not downloaded, neither there is any 
    attempt to establish the connection with the server. It just prints 
    out the filename and its url that would have been attempted to be 
    downloaded in Normal Mode. 

    By Default, Debug Mode is inactive. In order to activate it, we 
    need to supply a keyword-argument as 'debugging=True', like: 
    >>> download_file('http://localhost/index.html', 
         'http://localhost/info.php', 
         debugging=True) 
    >>> download_file('http://localhost/index.html', 
         'http://localhost/info.php', 
         download_location='/home/aditya/Download/test', 
         debugging=True) 

    """ 
    # Append a trailing slash at the end of download_location if not 
    # already present 
    if download_location[-1] != '/': 
     download_location = download_location + '/' 

    # Create the folder for download_location if not already present 
    os.makedirs(download_location, exist_ok=True) 

    # Other variables 
    time_format = '%Y-%b-%d %H:%M:%S' # '2000-Jan-01 22:10:00' 

    # "Request Headers" information for the file to be downloaded 
    accept = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' 
    accept_encoding = 'gzip, deflate' 
    accept_language = 'en-US,en;q=0.5' 
    connection = 'keep-alive' 
    user_agent = 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:20.0) \ 
        Gecko/20100101 Firefox/20.0' 
    headers = {'Accept': accept, 
       'Accept-Encoding': accept_encoding, 
       'Accept-Language': accept_language, 
       'Connection': connection, 
       'User-Agent': user_agent, 
       } 

    # Loop through all the files to be downloaded 
    for url in urls: 
     filename = os.path.basename(url) 
     if not debugging: 
      try: 
       request_sent = urllib.request.Request(url, None, headers) 
       response_received = urllib.request.urlopen(request_sent) 
      except urllib.error.URLError as error_encountered: 
       print(datetime.datetime.now().strftime(time_format), 
         ':', filename, '- The file could not be downloaded.') 
       if hasattr(error_encountered, 'code'): 
        print(' ' * 22, 'Error Code -', error_encountered.code) 
       if hasattr(error_encountered, 'reason'): 
        print(' ' * 22, 'Reason -', error_encountered.reason) 
      else: 
       read_response = response_received.read() 
       output_file = download_location + filename 
       with open(output_file, 'wb') as downloaded_file: 
        downloaded_file.write(read_response) 
       print(datetime.datetime.now().strftime(time_format), 
         ':', filename, '- Downloaded successfully.') 
     else: 
      print(datetime.datetime.now().strftime(time_format), 
        ': Debugging :', filename, 'would be downloaded from :\n', 
        ' ' * 21, url) 

此功能适用于下载PDF文件,图像和其他格式,但它给文本文件如html文件带来麻烦。我怀疑这个问题必须做一些与此行结尾:

with open(output_file, 'wb') as downloaded_file: 

所以,我曾试图wt模式下打开它。也尝试仅使用w模式。但是这并不能解决问题。

另一个问题可能已经被编码,所以我也包含第二行:

# -*- coding: utf8 -*- 

但是,这仍然无法正常工作。可能是什么问题,以及如何使它适用于文本和二进制文件?什么不起作用

例子:

>>>download_file("http://docs.python.org/3/tutorial/index.html") 

当我Gedit的打开它,它显示为:

在Firefox打开时

in gedit

同理:

in firefox

+1

究竟是什么问题/错误? –

+0

@StephaneRolland:它不会给出任何错误。但是,当我在文本编辑器中打开文档时,它会报告有关编码的问题。我会在一会儿上传图片.. – Aditya

+0

哪个文本编辑器? –

回答

2

该文件你正在下载已经用gzip编码发送 - 你可以看到,如果你zcat index.html,下载的文件显示正确。在代码中,你可能需要添加类似:

if response_received.headers.get('Content-Encoding') == 'gzip': 
    read_response = zlib.decompress(read_response, 16 + zlib.MAX_WBITS) 

编辑:

好了,我不能说,为什么它在Windows(不幸的是我没有Windows中测试它),但如果你发布响应的转储(即将响应对象转换为字符串),这可能会提供一些洞察。据推测,服务器选择不使用gzip编码进行发送,但考虑到该代码对头文件非常明确,我不确定会有什么不同。

值得一提的是,您的标头明确指定允许gzip和deflate(请参阅accept_encoding)。如果你删除了这个头部,你不必担心在任何情况下解压缩响应。

+0

你如何解释它在Windows 7下的计算机中完美工作? –

+0

这工作。不过,我还想解释为什么它在Windows中工作。同时,我会尝试其他排列和组合,如更改标题和其他内容。 – Aditya