如何自动从网站下载压缩文件

-2

我需要自动从没有唯一URL地址的网站下载压缩文件。数据在右边的相关下载链接下面。我没有任何Python或任何脚本的经验，所以我需要一个可供新手使用的工具。我也想知道自动化是否可以包含解压缩文件。如何自动从网站下载压缩文件

我将不胜感激任何协助/建议。

http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data

2016-03-22 Gloria

嗨，欢迎来到StackOverflow。你应该改进你的答案，展示一些努力，并提供更多细节。请阅读[如何问]（http://stackoverflow.com/help/how-to-ask）。这里询问软件推荐的问题类型是不够的。但是，您可以尝试查看[Flashget]（http://www.flashget.com/）是否能满足您的目的。 – iled

你应该看看BeautifulSoup和requests作为您的起点。我会使用那个每天运行一次的脚本编写脚本，并检查新文件的zip文件链接。

import requests 

from bs4 import BeautifulSoup 

url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data' 
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser') 
all_hrefs = soup.find_all('a') 
all_links = [link.get('href') for link in all_hrefs] 
zip_files = [dl for dl in all_links if dl and '.zip' in dl]

这会让您看到该主着陆页上的所有zip文件（假设扩展名总是小写）的列表。我只是将这些信息保存到SQLite数据库中，或者甚至只是一个纯文本文件，每个zip文件放在一行中。然后当你运行这个脚本时，它会使用上面的代码抓取链接，打开数据库（或者文本文件）并比较，看看里面是否有新东西。

如果它找到一个新的链接，那么你可以使用美妙的requests库下载它。您需要像这样：

import os 
import requests 

root = 'http://phmsa.dot.gov/' 
download_folder = '/path/to/download/zip/files/' 

for zip_file in zip_files: 
    full_url = root + zip_file 
    r = requests.get(full_url) 
    zip_filename = os.path.basename(zip_file) 
    dl_path = os.path.join(download_folder, zip_filename) 
    with open(dl_path, 'wb') as z_file: 
     z_file.write(r.content)

这里有一个完整的例子，这将只是每次下载网页上的所有压缩文件，你运行它：

import os 
import requests 

from bs4 import BeautifulSoup 

url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data' 
root = 'http://phmsa.dot.gov/' 

r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser') 

all_hrefs = soup.find_all('a') 
all_links = [link.get('href') for link in all_hrefs] 
zip_files = [dl for dl in all_links if dl and '.zip' in dl] 
download_folder = '/home/mdriscoll/Downloads/zip_files' 

if not os.path.exists(download_folder): 
    os.makedirs(download_folder) 

for zip_file in zip_files: 
    full_url = root + zip_file 
    r = requests.get(full_url) 
    zip_filename = os.path.basename(zip_file) 
    dl_path = os.path.join(download_folder, zip_filename) 
    with open(dl_path, 'wb') as z_file: 
     z_file.write(r.content)

更新＃2 - 添加解压功能：

import os 
import requests 
import zipfile 

from bs4 import BeautifulSoup 

url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data' 
root = 'http://phmsa.dot.gov/' 

r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser') 

all_hrefs = soup.find_all('a') 
all_links = [link.get('href') for link in all_hrefs] 
zip_files = [dl for dl in all_links if dl and '.zip' in dl] 
download_folder = '/home/mdriscoll/Downloads/zip_files' 

if not os.path.exists(download_folder): 
    os.makedirs(download_folder) 

tries = 0 
for zip_file in zip_files: 
    full_url = root + zip_file 
    zip_filename = os.path.basename(zip_file) 
    dl_path = os.path.join(download_folder, zip_filename) 
    if os.path.exists(dl_path): 
     # you have already downloaded this file, so skip it 
     continue 

    while tries < 3: 
     r = requests.get(full_url) 
     dl_path = os.path.join(download_folder, zip_filename) 
     with open(dl_path, 'wb') as z_file: 
      z_file.write(r.content) 

     # unzip the file 
     extract_dir = os.path.splitext(os.path.basename(zip_file))[0] 
     try: 
      z = zipfile.ZipFile(dl_path) 
      z.extractall(os.path.join(download_folder, extract_dir)) 
      break 
     except zipfile.BadZipfile: 
      # the file didn't download correctly, so try again 
      # this is also a good place to log the error 
      pass 
     tries += 1

我在测试中发现，偶尔该文件将无法正常下载，我会得到一个BadZipFile错误，所以我添加了一些代码，在尝试下一个文件之前会尝试3次。

来源

2016-03-22 21:32:35

好的，我可以在我的电脑上下载和安装Python，以及“请求”和“BeautifulSoup”。由于我是Python的新手，我安装了“PyCharm Edu”来运行代码。我尝试使用PhyCharmEdu运行这两个脚本，但我没有收到任何回应。我能够找到zip文件的URL（见下文）。你能告诉我如何让文件自动下载吗？另外，是否可以通过脚本自动解压文件？ http://www.phmsa.dot.gov/staticfiles/PHMSA/DownloadableFiles/Pipeline2data/annual_hazardous_liquid_2010_present.zip – Gloria

我添加了一个完整的例子，下载所有的zip文件，但不做任何检查，看看你是否已经下载了它。这应该很容易添加。 Python有一个'zipfile'模块，可以解压缩它们 - https://docs.python.org/2/library/zipfile.html –

Mike，谢谢你的脚本。它完美的作品。我会看看文档来解压缩文件。 – Gloria

如何自动从网站下载压缩文件

回答

相关问题