2013-08-02 20 views
0

我已经在教程的帮助下制作了一个网络爬虫,该教程从给定的url获取所有链接,并且您可以传递一个数字,它对应于步骤/链接距离的深度。现在,当您在scraperOut = scraper(url,3)中定义一个当前编号为3的数字时,搜寻器会深入三步,并将链接附加到同一列表。我的问题是如何以及如何在代码中进行修改,以便我可以选择单独列出每个列表而不是全部列在一个列表中,或者例如打印出第二步列表?整个代码看起来是这样的:Python网络爬虫,在不同的列表中打印每一步

import urllib 
import re 
import time 
from threading import Thread 
import MySQLdb 
import mechanize 
import readability 
from bs4 import BeautifulSoup 
from readability.readability import Document 
import urlparse 

url = "http://www.adbnews.com/area51/" 

def scraper(root,steps): 
    urls = [root] 
    visited = [root] 
    counter = 0 
    while counter < steps: 
     step_url = scrapeStep(urls) 
     urls = [] 
     for u in step_url: 
      if u not in visited: 
       urls.append(u) 
       visited.append(u) 
     counter +=1 

    return visited 

def scrapeStep(root): 
    result_urls = [] 
    br = mechanize.Browser() 
    br.set_handle_robots(False) 
    br.addheaders = [('User-agent', 'Firefox')] 

    for url in root: 
     try: 
      br.open(url) 
      for link in br.links(): 
       newurl = urlparse.urljoin(link.base_url, link.url) 
       result_urls.append(newurl) 
     except: 
      print "error" 
    return result_urls 

d = {} 
threadlist = [] 

def getReadableArticle(url): 
    br = mechanize.Browser() 
    br.set_handle_robots(False) 
    br.addheaders = [('User-agent', 'Firefox')] 

    html = br.open(url).read() 

    readable_article = Document(html).summary() 
    readable_title = Document(html).short_title() 

    soup = BeautifulSoup(readable_article) 

    final_article = soup.text 

    links = soup.findAll('img', src=True) 

    return readable_title 
    return final_article 

def dungalo(urls): 
    article_text = getReadableArticle(urls)[0] 
    d[urls] = article_text 

def getMultiHtml(urlsList): 
    for urlsl in urlsList: 
     try: 
      t = Thread(target=dungalo, args=(urls1,)) 
      threadlist.append(t) 
      t.start() 
     except: 
      nnn = True 

    for g in threadlist: 
     g.join() 

    return d 


scraperOut = scraper(url,3) 

for s in scraperOut: 
    print s 

#print scraperOut 

回答

0

我认为,如果你改变你的代码读取的部分:

return readable_title 
    return final_article 

阅读:

print readable_title 
    return final_article 

你会得到很多你所要求的并且有更多的机会让你的代码工作!注:使用原始代码,您将永远不会返回final_article,因为它首先返回readable_title

+0

谢谢,我已经改变了,但那不是我的主要问题 – dzordz