如何使用--accept-regex选项通过wget下载网站？

我想下载我的网站档案 - 3dsforums.com - 使用wget，但有数百万页我不要想下载，所以我想告诉wget只下载匹配某些页面URL模式，但我遇到了一些障碍。如何使用--accept-regex选项通过wget下载网站？

作为一个例子，这是我想下载网址：

http://3dsforums.com/forumdisplay.php?f=46

...所以我使用--accept-regex选项尝试：

wget -mkEpnp --accept-regex "(forumdisplay\.php\?f=(\d+)$)" http://3dsforums.com

只是但下载网站的主页。

是远程工作迄今唯一的命令如下：

wget -mkEpnp --accept-regex "(\w+\.php$)" http://3dsforums.com

这提供了以下回应：

Downloaded 9 files, 215K in 0.1s (1.72 MB/s) 
Converting links in 3dsforums.com/faq.php.html... 16-19 
Converting links in 3dsforums.com/index.html... 8-88 
Converting links in 3dsforums.com/sendmessage.php.html... 14-15 
Converting links in 3dsforums.com/register.php.html... 13-14 
Converting links in 3dsforums.com/showgroups.php.html... 14-29 
Converting links in 3dsforums.com/index.php.html... 16-80 
Converting links in 3dsforums.com/calendar.php.html... 17-145 
Converting links in 3dsforums.com/memberlist.php.html... 14-99 
Converting links in 3dsforums.com/search.php.html... 15-16 
Converted links in 9 files in 0.009 seconds.

是不是有什么毛病我的正则表达式？或者我误解了使用--accept-regex选项？我今天一直在尝试各种各样的变化，但我并没有完全理解实际的问题。

来源

2017-05-27 David Turnbull

wget默认使用POSIX正则表达式\d类被表示为[:digit:]和\w类表示为[:word:]，以及为什么所有的分组？如果您wget与PCRE支持编译使您的生活更轻松，做得一样：

wget的-mkEpnp --regex型PCRE --accept正则表达式 “？forumdisplay.php \ F = \ d + $” http://3dsforums.com

，但...这是行不通的，因为你的论坛软件自动创建会话ID（s=<session_id>）并注入他们的所有环节，所以你需要考虑那些还有：

wget -mkEpnp --regex-type pcre --accept-regex "forumdisplay\.php\?(s=.*)?f=\d+(s=.*)?$" http://3dsforums.com

唯一的问题是现在你的文件将被保存在他们的名字中的会话ID，所以你必须在时再添加一个步骤已完成 - 批量重命名其名称中带有会话标识的所有文件。你也许可以通过管道wget到sed做到这一点，但我会留给你:)

如果您wget不支持PCRE这种模式最终会是相当长的，但让我们希望它.. 。

来源

2017-05-27 01:58:15 zwer

如何使用--accept-regex选项通过wget下载网站？

回答

相关问题