使用python获取avalibale在网站中的所有链接？

是否有任何方式使用python获取网站中的所有链接，而不仅仅是在网页中？我想这个代码，但是这是给我只能在网页链接使用python获取avalibale在网站中的所有链接？

import urllib2 
import re 

#connect to a URL 
website = urllib2.urlopen('http://www.example.com/') 

#read html code 
html = website.read() 

#use re.findall to get all the links 
links = re.findall('"((http|ftp)s?://.*?)"', html) 

print links

来源

2016-02-29 Mohamed Elsharkawey

你是什么意思与“中的所有链接网站不仅在网页上“？你的意思是存储在www.example.com上的任何html页面中包含的每一个链接？ – syntonym

是的，这就是我的意思 –

你不能那样做。你甚至可能无法访问所有的html页面。但是，您可以递归访问您收集的链接（如果他们也指向www.exmaple.com或者它们是相对链接）并从那里获取所有链接。然而，这可能不是“全部链接”，例如如果页面example.com/jfifjfi中没有链接指向您将无法访问该页面。 – syntonym

访问递归你收集的链接，太废以下页面：

import urllib2 
import re 

stack = ['http://www.example.com/'] 
results = [] 

while len(stack) > 0: 

    url = stack.pop() 
    #connect to a URL 
    website = urllib2.urlopen(url) 

    #read html code 
    html = website.read() 

    #use re.findall to get all the links 
    # you should not only gather links with http/ftps but also relative links 
    # you could use beautiful soup for that (if you want <a> links) 
    links = re.findall('"((http|ftp)s?://.*?)"', html) 

    result.extend([link in links if is_not_relative_link(link)]) 

    for link in links: 
     if link_is_valid(link): #this function has to be written 
      stack.push(link)

来源

2016-02-29 14:44:09 syntonym

如果link_is_valid（链接）：＃此函数必须写入 NameError：名称'link_is_valid'未定义 –

是的。因此我写了“＃这个函数必须写”作为评论。您必须检查a）您是否已经访问过该链接b）如果您甚至想要访问该链接（即它是否链接到您想要访问的页面“example.com”，或者它是否链接到例如wikipedia）c）如果您可以访问它（目前你正在获得ftp链接，我不认为urllib2可以处理它们？）。 – syntonym

使用python获取avalibale在网站中的所有链接？

回答

相关问题