我的webcrawler没有循环获取所有链接 - 使用富功能（Python）

我正在创建一个web爬网程序，在第一步中，我需要爬取一个网站并提取其所有链接，但是我的代码没有循环到提取。我尝试使用append，但结果列表的列表。我正尝试使用foo，并且出现错误。任何帮助，将不胜感激。谢谢我的webcrawler没有循环获取所有链接 - 使用富功能（Python）

from urllib2 import urlopen 

import re 

def get_all_urls(url): 

    get_content = urlopen(url).read() 
    url_list = [] 

    find_url = re.compile(r'a\s?href="(.*)">') 
    url_list_temp = find_url.findall(get_content) 
    for i in url_list_temp: 
     url_temp = url_list_temp.pop() 
     source = 'http://blablabla/' 
     url = source + url_temp 
     url_list.append(url) 
    #print url_list 
    return url_list 


def web_crawler(seed): 

    tocrawl = [seed] 
    crawled = [] 

    i = 0 

    while i < len(tocrawl): 
     page = tocrawl.pop() 
     if page not in crawled: 
      #tocrawl.append(get_all_urls(page)) 
      foo = (get_all_urls(page)) 
      tocrawl = foo 
      crawled.append(page) 
     if not tocrawl: 
      break 
    print crawled 
    return crawled

来源

2013-10-25 user2918712

首先，这是一个坏主意，用正则表达式解析HTML，所有的理由列出：

这里：Python regular expression for HTML parsing (BeautifulSoup)
这里：Python regex to match HTML
这里：regexp python with parsing html page
等等。

您应该使用HTML解析器来处理作业。 Python在其标准库中提供了一个：HTMLParser，但您也可以使用BeautifulSoup或甚至lxml。我倾向于倾向于BeautifulSoup，因为它有很好的API。

现在，回到你的问题，你修改你迭代的列表：

for i in url_list_temp: 
    url_temp = url_list_temp.pop() 
    source = 'http://blablabla/' 
    ...

这是不好的，因为它比喻达锯你坐在一个分支。此外，你似乎并不需要这种删除，因为没有条件的网址必须删除或不。

最后，在使用append后会出现错误，因为如您所说，它会创建一个列表清单。您应该改用extend。

>>> l1 = [1, 2, 3] 
>>> l2 = [4, 5, 6] 
>>> l1.append(l2) 
>>> l1 
[1, 2, 3, [4, 5, 6]] 
>>> l1 = [1, 2, 3] 
>>> l1.extends(l2) 
>>> l1 
[1, 2, 3, 4, 5, 6]

NB：看看http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/以获得更多帮助与beautifulsoup

刮

来源

2013-10-25 07:22:20

我的webcrawler没有循环获取所有链接 - 使用富功能（Python）

回答

相关问题