2015-10-03 46 views
2

是否可以使用wget镜像来保存整个网站的所有链接并将它们保存在txt文件中?镜像整个网站并保存txt文件中的链接

如果可能,它是如何完成的?如果没有,是否有其他方法可以做到这一点?

编辑:

我试图运行这个命令:

wget -r --spider example.com 

,得到了这样的结果:

Spider mode enabled. Check if remote file exists. 
--2015-10-03 21:11:54-- http://example.com/ 
Resolving example.com... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946 
Connecting to example.com|93.184.216.34|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 1270 (1.2K) [text/html] 
Remote file exists and could contain links to other resources -- retrieving. 

--2015-10-03 21:11:54-- http://example.com/ 
Reusing existing connection to example.com:80. 
HTTP request sent, awaiting response... 200 OK 
Length: 1270 (1.2K) [text/html] 
Saving to: 'example.com/index.html' 

100%[=====================================================================================================>] 1,270  --.-K/s in 0s  

2015-10-03 21:11:54 (93.2 MB/s) - 'example.com/index.html' saved [1270/1270] 

Removing example.com/index.html. 

Found no broken links. 

FINISHED --2015-10-03 21:11:54-- 
Total wall clock time: 0.3s 
Downloaded: 1 files, 1.2K in 0s (93.2 MB/s) 

(Yes, I also tried using other websites with more internal links) 
+0

是的,这是它应该如何工作。实际网站“example.com”没有内部链接,所以它只是返回自己。尝试一个网站链接到网站内的其他网页,你应该得到更多。你是否也想要链接到* external *网站?如果是这样,来自@Randomazer的python脚本可能是一个更好的选择。 – seumasmac

+0

其实,有一个类似的问题,你可以在:http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only哪些可能是有用的。 – seumasmac

+0

非常感谢!这有帮助! – user1878980

回答

0

是,使用wget的--spider选项。一个命令如:

wget -r --spider example.com 

将获得所有链接的深度为5(默认值)。然后,您可以将输出捕获到一个文件中,也许可以随时清理它。喜欢的东西:

wget -r --spider example.com 2>&1 | grep "http://" | cut -f 4 -d " " >> weblinks.txt 

会把刚刚链接到weblinks.txt文件(如果您的wget的版本有略微不同的输出,你可能需要调整该命令一点点)。

+0

好的,谢谢。我试图复制你写的脚本,但它并没有起作用。它创建了一个weblinks.txt文件,但它只在.txt文件中保存了http://www.example.com(我试图输入其他网站)。也许我需要调整它,问题是我不知道如何。 – user1878980

+0

你可以运行第一个命令,看看它给出了什么输出?请注意,通过遵循您提供的页面上的链接,找出其他页面的唯一方法就是找到它。如果没有任何其他页面的链接,它将不会找到其他任何内容。 – seumasmac

+0

在这些评论中添加详细信息很困难,因此您可能会发现更新您的问题更容易,其中详细介绍了您尝试的内容。 – seumasmac

0

或者使用python:

的exaple

import urllib, re 

def do_page(url): 
    f = urllib.urlopen(url) 
    html = f.read() 
    pattern = r"'{}.*.html'".format(url) 
    hits = re.findall(pattern, html) 
    return hits 

if __name__ == '__main__': 
    hits = [] 
    url = 'http://thehackernews.com/' 
    hits.extend(do_page(url)) 
    with open('links.txt', 'wb') as f1: 
     for hit in hits: 
      f1.write(hit) 

日期:

'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/09/digital-india-facebook.html' 
'http://thehackernews.com/2015/09/digital-india-facebook.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/09/winrar-vulnerability.html' 
'http://thehackernews.com/2015/09/winrar-vulnerability.html' 
'http://thehackernews.com/2015/09/chip-mini-computer.html' 
'http://thehackernews.com/2015/09/chip-mini-computer.html' 
'http://thehackernews.com/2015/09/edward-snowden-twitter.html' 
'http://thehackernews.com/2015/09/edward-snowden-twitter.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/09/quantum-teleportation-data.html' 
'http://thehackernews.com/2015/09/quantum-teleportation-data.html' 
'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html' 
'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html' 
'http://thehackernews.com/2015/09/xor-ddos-attack.html' 
'http://thehackernews.com/2015/09/xor-ddos-attack.html' 
'http://thehackernews.com/2015/09/truecrypt-encryption-software.html' 
'http://thehackernews.com/2015/09/truecrypt-encryption-software.html'