2

Hello!我写了小型网页爬虫功能。但我是多线程新手,我无法优化它。我的代码是:使用多线程优化python脚本

alreadySeenURLs = dict() #the dictionary of already seen crawlers 
candidates = set() #the set of URL candidates to crawl 

def initializeCandidates(url): 

    #gets page with urllib2 
    page = getPage(url) 

    #parses page with BeautifulSoup 
    parsedPage = getParsedPage(page) 

    #function which return all links from parsed page as set 
    initialURLsFromRoot = getLinksFromParsedPage(parsedPage) 

    return initialURLsFromRoot 

def updateCandidates(oldCandidates, newCandidates): 
    return oldCandidates.union(newCandidates) 

candidates = initializeCandidates(rootURL) 

for url in candidates: 

    print len(candidates) 

    #fingerprint of URL 
    fp = hashlib.sha1(url).hexdigest() 

    #checking whether url is in alreadySeenURLs 
    if fp in alreadySeenURLs: 
     continue 

    alreadySeenURLs[fp] = url 

    #do some processing 
    print url 

    page = getPage(url) 
    parsedPage = getParsedPage(page, fix=True) 
    newCandidates = getLinksFromParsedPage(parsedPage) 

    candidates = updateCandidates(candidates, newCandidates) 

正如人们可以看到的,这里它在特定时间需要一个来自候选人的URL。我想让这个脚本多线程,以这样的方式,它可能需要至少N个候选人的URL,并完成这项工作。任何人都可以引导我?给出任何链接或建议?

+2

有很多关于线程的教程,只是Google的“python线程教程”。线程教程用Python编程(https://users.info.unicaen.fr/~fmaurel/documents/envrac/python/PyThreads.pdf)是绝对初学者的一个很好的教程。 – taskinoor

回答