使用多线程优化python脚本

Hello！我写了小型网页爬虫功能。但我是多线程新手，我无法优化它。我的代码是：使用多线程优化python脚本

alreadySeenURLs = dict() #the dictionary of already seen crawlers 
candidates = set() #the set of URL candidates to crawl 

def initializeCandidates(url): 

    #gets page with urllib2 
    page = getPage(url) 

    #parses page with BeautifulSoup 
    parsedPage = getParsedPage(page) 

    #function which return all links from parsed page as set 
    initialURLsFromRoot = getLinksFromParsedPage(parsedPage) 

    return initialURLsFromRoot 

def updateCandidates(oldCandidates, newCandidates): 
    return oldCandidates.union(newCandidates) 

candidates = initializeCandidates(rootURL) 

for url in candidates: 

    print len(candidates) 

    #fingerprint of URL 
    fp = hashlib.sha1(url).hexdigest() 

    #checking whether url is in alreadySeenURLs 
    if fp in alreadySeenURLs: 
     continue 

    alreadySeenURLs[fp] = url 

    #do some processing 
    print url 

    page = getPage(url) 
    parsedPage = getParsedPage(page, fix=True) 
    newCandidates = getLinksFromParsedPage(parsedPage) 

    candidates = updateCandidates(candidates, newCandidates)

正如人们可以看到的，这里它在特定时间需要一个来自候选人的URL。我想让这个脚本多线程，以这样的方式，它可能需要至少N个候选人的URL，并完成这项工作。任何人都可以引导我？给出任何链接或建议？

来源

2012-05-23 torayeff

有很多关于线程的教程，只是Google的“python线程教程”。线程教程用Python编程（https://users.info.unicaen.fr/~fmaurel/documents/envrac/python/PyThreads.pdf）是绝对初学者的一个很好的教程。 – taskinoor

您可以通过这两个环节入手：

基本参考了在Python 线程http://docs.python.org/library/threading.html
的讲解，他们实际上是在Python实现多线程URL履带 http://www.ibm.com/developerworks/aix/library/au-threadingpython/

此外，你已经有一个Python的爬虫：http://scrapy.org/

来源

2012-05-23 14:59:39 betabandido

使用多线程优化python脚本

回答

相关问题