什么是从网站下载所有图像的最快速和最简单的方法

从网站下载所有图像的最快速和最简单的方法是什么？更具体地说，http://www.cycustom.com/large/。什么是从网站下载所有图像的最快速和最简单的方法

我正在思考一些wget或curl的问题。为了澄清，首先（也是最重要的）我目前不知道如何完成这项任务。其次，我感兴趣的是看看wget或curl是否有更容易理解的解决方案。谢谢。

--- UPDATE @sarnold ---

谢谢你的回复。我认为这也可以做到这一点。但是，它没有。下面是命令的输出：

wget --mirror --no-parent http://www.cycustom.com/large/ 
--2012-01-10 18:19:36-- http://www.cycustom.com/large/ 
Resolving www.cycustom.com... 64.244.61.237 
Connecting to www.cycustom.com|64.244.61.237|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: unspecified [text/html] 
Saving to: `www.cycustom.com/large/index.html' 

    [ <=>                                                         ] 188,795  504K/s in 0.4s  

Last-modified header missing -- time-stamps turned off. 
2012-01-10 18:19:37 (504 KB/s) - `www.cycustom.com/large/index.html' saved [188795] 

Loading robots.txt; please ignore errors. 
--2012-01-10 18:19:37-- http://www.cycustom.com/robots.txt 
Connecting to www.cycustom.com|64.244.61.237|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 174 [text/plain] 
Saving to: `www.cycustom.com/robots.txt' 

100%[======================================================================================================================================================================================================================================>] 174   --.-K/s in 0s  

2012-01-10 18:19:37 (36.6 MB/s) - `www.cycustom.com/robots.txt' saved [174/174] 

FINISHED --2012-01-10 18:19:37-- 
Downloaded: 2 files, 185K in 0.4s (505 KB/s)

这里的文件的图片创建https://img.skitch.com/20120111-nputrm7hy83r7bct33midhdp6d.jpg

我的目标是有一个完整的图像文件的文件夹。下面的命令没有达到这个目标。

wget --mirror --no-parent http://www.cycustom.com/large/

来源

2012-01-11 John Erck

@sarnold [以下是创建瓦特/一些笔记index.html文件的图片]（https://img.skitch.com/ 20120111-1uapp8upbq6qmtrwsqsiygg62i.jpg） – 2012-01-11 02:40:06

wget --mirror --no-parent http://www.example.com/large/

的--no-parent防止它啜了整个网站。

啊，我看他们都放在robots.txt要求机器人不下载照片从该目录：

$ curl http://www.cycustom.com/robots.txt 
User-agent: * 
Disallow: /admin/ 
Disallow: /css/ 
Disallow: /flash/ 
Disallow: /large/ 
Disallow: /pdfs/ 
Disallow: /scripts/ 
Disallow: /small/ 
Disallow: /stats/ 
Disallow: /temp/ 
$

wget(1)不会记录任何方法忽略robots.txt，我从来没有发现在curl(1)中执行相当于--mirror的简单方法。如果您想继续使用wget(1)，那么您需要在中间插入一个HTTP代理，并返回404请求GET /robots.txt。

我认为改变方法比较容易。因为我想用Nokogiri更多的经验，这就是我想出了：

#!/usr/bin/ruby 
require 'open-uri' 
require 'nokogiri' 

doc = Nokogiri::HTML(open("http://www.cycustom.com/large/")) 

doc.css('tr > td > a').each do |link| 
    name = link['href'] 
    next unless name.match(/jpg/) 
    File.open(name, "wb") do |out| 
    out.write(open("http://www.cycustom.com/large/" + name)) 
    end 
end

这仅仅是一个快速和肮脏的脚本 - 嵌入URL两次是有点难看。所以如果这是为了长期生产使用，请先清理它 - 或者找出如何使用rsync(1)来替代。

来源

2012-01-11 00:31:02 sarnold

编辑原始问题以包含您的建议结果 – 2012-01-11 02:41:39

的robots.txt文件可以通过添加下面的选项被忽略：

-e robots=off

我还建议增加一个选项，以限制服务器上的负载，以减慢下载。例如，该选项等待一个文件，下一个间隔为30秒：

--wait 30

来源

2013-05-29 08:55:44 Andrea

什么是从网站下载所有图像的最快速和最简单的方法

回答

相关问题