删除禁用词

我尝试了以下程序的正常工作：删除禁用词

我想从网页上删除停用词，所以FEED_URL =“http://feeds.feedburner.com/oreilly/radar/atom”它成功运行，但是当我改变网址然后它会给出一个错误

import os 

import sys 
import json 
import feedparser 
from BeautifulSoup import BeautifulStoneSoup 
from nltk import clean_html 

FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'    

def cleanHtml(html): 
    return BeautifulStoneSoup(clean_html(html), 
      convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] 

    fp = feedparser.parse(FEED_URL) 

    print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) 
    #print "Fetched %s entries from '%s'" % (len(fp.entries[0]) 

    blog_posts = [] 
    for e in fp.entries: 
     blog_posts.append({'title': e.title, 'content' 
        : cleanHtml(e.content[0].value), 'link': e.links[0].href}) 

     out_file = os.path.join('resources', 'ch05-webpages', 'feed.json') 
     f = open(out_file, 'w') 
     f.write(json.dumps(blog_posts, indent=1)) 
     f.close() 
     print ('Wrote output file to %s' % (f.name,))

但是，当我更改URL，然后提示错误

 FEED_URL = 'http://www.thehindu.com'

错误：

 IndexError        Traceback (most recent call last) 
    <ipython-input-1-b80b4061a360> in <module>() 
    14 fp = feedparser.parse(FEED_URL) 
    15 
    ---> 16 print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) 
    17 #print "Fetched %s entries from '%s'" % (len(fp.entries[0]) 
    18 

    IndexError: list index out of range

那么有人可以帮我解决这个问题吗？

来源

2014-02-28 Prush

看起来像您使用的供稿网址不正确。

尝试：

FEED_URL = 'http://www.thehindu.com/?service=rss'

对于其他提要：http://www.thehindu.com/navigation/?type=rss

来源

2014-04-20 09:11:46 shantanoo

回答

相关问题