2012-05-31 180 views
1

我需要关于我正在处理的文本云计划的帮助。我意识到这是家庭作业,但我自己已经做得很好,现在只能忍受几个小时。我被困在网络爬虫部分。该程序应该打开一个页面,收集该页面中的所有单词,并按频率对它们进行排序。然后它应该打开该页面上的任何链接并获取该页面上的文字等。深度由全局变量DEPTH控制。最后,它应该将所有页面中的所有单词放在一起形成一个文本云。网络爬虫文本云

我想使用递归调用一个函数,以保持打开链接,直到达到深度。顶部的import语句只使用一个名为getHTML(URL)的函数,该函数返回页面上单词列表的元组以及页面上的任何链接。

这是我的代码到目前为止。除了getRecursiveURLs(url,DEPTH)和makeWords(i)之外,每个函数都可以正常工作。我也不是100%确定底部的计数器(List)函数。

from hmc_urllib import getHTML 

MAXWORDS = 50 
DEPTH = 2 

all_links = [] 

def getURL(): 
    """Asks the user for a URL""" 

    URL = input('Please enter a URL: ') 

    #all_links.append(URL) 

    return makeListOfWords(URL), getRecursiveURLs(URL, DEPTH) 


def getRecursiveURLs(url, DEPTH): 
    """Opens up all links and adds them to global all_links list, 
    if they're not in all_links already""" 

    s = getHTML(url) 
    links = s[1] 
    if DEPTH > 0: 
     for i in links: 
      getRecursiveURLs(i, DEPTH - 1) 
      if i not in all_links: 
       all_links.append(i) 
       #print('This is all_links in the IF', all_links) 
       makeWords(i)#getRecursiveURLs(i, DEPTH - 1) 
      #elif i in all_links: 

      # print('This is all_links in the ELIF', all_links) 
       # makeWords(i) #getRecursiveURLs(i, DEPTH - 1) 
    #print('All_links at the end', all_links) 
    return all_links 





def makeWords(i): 
    """Take all_links and create a dictionary for each page. 
    Then, create a final dictionary of all the words on all pages.""" 

    for i in all_links: 
     FinalDict = makeListOfWords(i) 
     #print(all_links) 
     #makeListOfWords(i)) 
    return FinalDict 


def makeListOfWords(URL): 
    """Gets the text from a webpage and puts the words into a list""" 

    text = getHTML(str(URL)) 
    L = text[0].split() 
    return cleaner(L) 


def cleaner(L): 

    """Cleans the text of punctuation and removes words if they are in the stop list.""" 

    stopList = ['', 'a', 'i', 'the', 'and', 'an', 'in', 'with', 'for', 
       'it', 'am', 'at', 'on', 'of', 'to', 'is', 'so', 'too', 
       'my', 'but', 'are', 'very', 'here', 'even', 'from', 
       'them', 'then', 'than', 'this', 'that', 'though'] 

    x = [dePunc(c) for c in L] 

    for c in x: 
     if c in stopList: 
      x.remove(c) 

    a = [stemmer(c) for c in x] 

    return counter(a) 


def dePunc(rawword): 
    """ de-punctuationifies the input string """ 

    L = [ c for c in rawword if 'A' <= c <= 'Z' or 'a' <= c <= 'z' ] 
    word = ''.join(L) 
    return word 


def stemmer(word): 

    """Stems the words""" 

    # List of endings 
    endings = ['ed', 'es', 's', 'ly', 'ing', 'er', 'ers'] 

    # This first case handles 3 letter suffixes WITH a doubled consonant. I.E. spammers -> spam 
    if word[len(word)-3:len(word)] in endings and word[-4] == word[-5]: 
     return word[0:len(word)-4] 

    # This case handles 3 letter suffixes WITHOUT a doubled consonant. I.E. players -> play 
    elif word[len(word)-3:len(word)] in endings and word[-4] != word[-5]: 
     return word[0:len(word)-3] 

    # This case handles 2 letter suffixes WITH a doubled consonant. I.E. spammed -> spam 
    elif word[len(word)-2:len(word)] in endings and word[-3] == word[-4]: 
     return word[0:len(word)-3] 

    # This case handles 2 letter suffixes WITHOUT a doubled consonant. I.E. played -> played 
    elif word[len(word)-2:len(word)] in endings and word[-3] != word[-4]: 
     return word[0:len(word)-3] 

    # If word not inflected, return as-is. 
    else: 
     return word 

def counter(List): 
    """Creates dictionary of words and their frequencies, 'sorts' them, 
    and prints them from most least frequent""" 

    freq = {} 
    result = {} 
# Assign frequency to each word 
    for item in List: 
     freq[item] = freq.get(item,0) + 1 

    # 'Sort' the dictionary by frequency 
    for i in sorted(freq, key=freq.get, reverse=True): 
     if len(result) < MAXWORDS: 
      print(i, '(', freq[i], ')', sep='') 
      result[i] = freq[i] 
    return result 
+0

有很多教程如何抓取网站,例如:http://ms4py.org/2010/4/10/python-search-engine-crawler-part-1/。 – schlamar

+0

你可以使用什么限制?我建议使用队列和线程而不是递归来抓取。 –

+0

另外,getRecursiveURLS()'究竟有什么问题? –

回答

2

这是不完全清楚的分配,但是从我可以收集你正在寻找访问的所有页面到深度一次且仅一次的精确要求。另外,您希望从所有页面中删除所有单词并处理汇总结果。下面的代码片段是你正在寻找的,但它没有经过测试(我没有hmc_urllib)。 all_links,makeWordsmakeListOfWords已被删除,但其余代码将相同。

visited_links = [] 

def getURL(): 
    url = input('Please enter a URL: ') 
    word_list = getRecursiveURLs(url, DEPTH) 
    return cleaner(word_list) # this prints the word count for all pages 

def getRecursiveURLs(url, DEPTH): 
    text, links = getHTML(url) 
    visited_links.append(url) 
    returned_word_list = text.split() 
    #cleaner(text.split()) # this prints the word count for the current page 

    if DEPTH > 0: 
     for link in links: 
      if link not in visited_links: 
       returned_word_list += getRecursiveURLs(link, DEPTH - 1) 
    return returned_word_list 

一旦你的清洁和茎的话,你可以用下面的函数生成字数字典和打印字数词典分别列表:

def counter(words): 
    """ 
    Example Input: ['spam', 'egg', 'egg', 'egg', 'spam', 'spam', 'egg', 'egg'] 
    Example Output: {'spam': 3, 'egg', 5} 
    """ 
    return dict((word, x.count(word)) for word in set(words)) 

def print_count(word_count, word_max): 
    """ 
    Example Input: {'spam': 3, 'egg', 5} 
    Prints the word list up to the word_max sorted by frequency 
    """ 
    for word in sorted(word_count, key=word_count.get, reverse=True)[:word_max]: 
     print(word,'(', word_count[word], ')', sep= '') 
+0

感谢您的回复!您给我的代码正确打印每个单词及其每页的频率,但创建的字典仅包含每页的整个文本。我需要最后一本词典,将所有页面中的单词作为键和频率作为值。这可以让我按照我需要的方式对它们进行分类。现在,它返回如下所示:spam(8)page(1)love(1)。这是第一页。下一页是:干(4)页(2)这些(2),等等。最终结果必须是垃圾邮件(8)干(4)页(3)这些(2)爱(1) –

+0

我不知道为什么它不适合你。一旦你到达'return cleaner(global_word_list)','global_word_list'应该包含所有页面中的所有单词。我已经多次阅读您的代码,并且您没有明确的理由表明您的行为。你是否对清洁剂或反制剂做过任何修改? 另外,您不应该使用List作为计数器的参数名称。 List是一个Python关键字,如果以其他方式使用,它可能会导致意外的行为。 –

+0

对不起所有的混淆评论。我发现了这个问题。首先,global_word_list + = text.split()行会返回错误。它表示在赋值之前引用局部变量。所以我把它改成了global_word_list.append(text.split())。分裂的问题是它创建一个列表。因此,当创建词典时,它会看到列表,这是每个页面的文本。我需要弄清楚如何使它成为一个清单。 –