2014-02-28 61 views
2

我尝试了以下程序的正常工作:删除禁用词

我想从网页上删除停用词,所以FEED_URL =“http://feeds.feedburner.com/oreilly/radar/atom”它成功运行,但是当我改变网址然后它会给出一个错误

import os 

import sys 
import json 
import feedparser 
from BeautifulSoup import BeautifulStoneSoup 
from nltk import clean_html 

FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'    

def cleanHtml(html): 
    return BeautifulStoneSoup(clean_html(html), 
      convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] 

    fp = feedparser.parse(FEED_URL) 

    print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) 
    #print "Fetched %s entries from '%s'" % (len(fp.entries[0]) 

    blog_posts = [] 
    for e in fp.entries: 
     blog_posts.append({'title': e.title, 'content' 
        : cleanHtml(e.content[0].value), 'link': e.links[0].href}) 

     out_file = os.path.join('resources', 'ch05-webpages', 'feed.json') 
     f = open(out_file, 'w') 
     f.write(json.dumps(blog_posts, indent=1)) 
     f.close() 
     print ('Wrote output file to %s' % (f.name,)) 

但是,当我更改URL,然后提示错误

 FEED_URL = 'http://www.thehindu.com' 

错误:

 IndexError        Traceback (most recent call last) 
    <ipython-input-1-b80b4061a360> in <module>() 
    14 fp = feedparser.parse(FEED_URL) 
    15 
    ---> 16 print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) 
    17 #print "Fetched %s entries from '%s'" % (len(fp.entries[0]) 
    18 

    IndexError: list index out of range 

那么有人可以帮我解决这个问题吗?

回答