2016-03-22 62 views
-2

我需要自动从没有唯一URL地址的网站下载压缩文件。数据在右边的相关下载链接下面。我没有任何Python或任何脚本的经验,所以我需要一个可供新手使用的工具。我也想知道自动化是否可以包含解压缩文件。如何自动从网站下载压缩文件

我将不胜感激任何协助/建议。

http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data

+0

嗨,欢迎来到StackOverflow。你应该改进你的答案,展示一些努力,并提供更多细节。请阅读[如何问](http://stackoverflow.com/help/how-to-ask)。这里询问软件推荐的问题类型是不够的。但是,您可以尝试查看[Flashget](http://www.flashget.com/)是否能满足您的目的。 – iled

回答

1

你应该看看BeautifulSouprequests作为您的起点。我会使用那个每天运行一次的脚本编写脚本,并检查新文件的zip文件链接。

import requests 

from bs4 import BeautifulSoup 

url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data' 
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser') 
all_hrefs = soup.find_all('a') 
all_links = [link.get('href') for link in all_hrefs] 
zip_files = [dl for dl in all_links if dl and '.zip' in dl] 

这会让您看到该主着陆页上的所有zip文件(假设扩展名总是小写)的列表。我只是将这些信息保存到SQLite数据库中,或者甚至只是一个纯文本文件,每个zip文件放在一行中。然后当你运行这个脚本时,它会使用上面的代码抓取链接,打开数据库(或者文本文件)并比较,看看里面是否有新东西。

如果它找到一个新的链接,那么你可以使用美妙的requests库下载它。您需要像这样:

import os 
import requests 

root = 'http://phmsa.dot.gov/' 
download_folder = '/path/to/download/zip/files/' 

for zip_file in zip_files: 
    full_url = root + zip_file 
    r = requests.get(full_url) 
    zip_filename = os.path.basename(zip_file) 
    dl_path = os.path.join(download_folder, zip_filename) 
    with open(dl_path, 'wb') as z_file: 
     z_file.write(r.content) 

这里有一个完整的例子,这将只是每次下载网页上的所有压缩文件,你运行它:

import os 
import requests 

from bs4 import BeautifulSoup 

url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data' 
root = 'http://phmsa.dot.gov/' 

r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser') 

all_hrefs = soup.find_all('a') 
all_links = [link.get('href') for link in all_hrefs] 
zip_files = [dl for dl in all_links if dl and '.zip' in dl] 
download_folder = '/home/mdriscoll/Downloads/zip_files' 

if not os.path.exists(download_folder): 
    os.makedirs(download_folder) 

for zip_file in zip_files: 
    full_url = root + zip_file 
    r = requests.get(full_url) 
    zip_filename = os.path.basename(zip_file) 
    dl_path = os.path.join(download_folder, zip_filename) 
    with open(dl_path, 'wb') as z_file: 
     z_file.write(r.content) 

更新#2 - 添加解压功能:

import os 
import requests 
import zipfile 

from bs4 import BeautifulSoup 

url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data' 
root = 'http://phmsa.dot.gov/' 

r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser') 

all_hrefs = soup.find_all('a') 
all_links = [link.get('href') for link in all_hrefs] 
zip_files = [dl for dl in all_links if dl and '.zip' in dl] 
download_folder = '/home/mdriscoll/Downloads/zip_files' 

if not os.path.exists(download_folder): 
    os.makedirs(download_folder) 

tries = 0 
for zip_file in zip_files: 
    full_url = root + zip_file 
    zip_filename = os.path.basename(zip_file) 
    dl_path = os.path.join(download_folder, zip_filename) 
    if os.path.exists(dl_path): 
     # you have already downloaded this file, so skip it 
     continue 

    while tries < 3: 
     r = requests.get(full_url) 
     dl_path = os.path.join(download_folder, zip_filename) 
     with open(dl_path, 'wb') as z_file: 
      z_file.write(r.content) 

     # unzip the file 
     extract_dir = os.path.splitext(os.path.basename(zip_file))[0] 
     try: 
      z = zipfile.ZipFile(dl_path) 
      z.extractall(os.path.join(download_folder, extract_dir)) 
      break 
     except zipfile.BadZipfile: 
      # the file didn't download correctly, so try again 
      # this is also a good place to log the error 
      pass 
     tries += 1 

我在测试中发现,偶尔该文件将无法正常下载,我会得到一个BadZipFile错误,所以我添加了一些代码,在尝试下一个文件之前会尝试3次。

+0

好的,我可以在我的电脑上下载和安装Python,以及“请求”和“BeautifulSoup”。由于我是Python的新手,我安装了“PyCharm Edu”来运行代码。我尝试使用PhyCharmEdu运行这两个脚本,但我没有收到任何回应。我能够找到zip文件的URL(见下文)。你能告诉我如何让文件自动下载吗?另外,是否可以通过脚本自动解压文件? http://www.phmsa.dot.gov/staticfiles/PHMSA/DownloadableFiles/Pipeline2data/annual_hazardous_liquid_2010_present.zip – Gloria

+0

我添加了一个完整的例子,下载所有的zip文件,但不做任何检查,看看你是否已经下载了它。这应该很容易添加。 Python有一个'zipfile'模块,可以解压缩它们 - https://docs.python.org/2/library/zipfile.html –

+0

Mike,谢谢你的脚本。它完美的作品。我会看看文档来解压缩文件。 – Gloria