我通过Reddit上多篇文章试图循环,经过每一篇文章,并提取相关的顶级实体(通过筛选获得最高关联得分完成),然后添加到列表master_locations
:在Python中每循环迭代清空列表?
from __future__ import print_function
from alchemyapi import AlchemyAPI
import json
import urllib2
from bs4 import BeautifulSoup
alchemyapi = AlchemyAPI()
reddit_url = 'http://www.reddit.com/r/worldnews'
urls = []
locations = []
relevance = []
master_locations = []
def get_all_links(page):
html = urllib2.urlopen(page).read()
soup = BeautifulSoup(html)
for a in soup.find_all('a', 'title may-blank ', href=True):
urls.append(a['href'])
run_alchemy_entity_per_link(a['href'])
def run_alchemy_entity_per_link(articleurl):
response = alchemyapi.entities('url', articleurl)
if response['status'] == 'OK':
for entity in response['entities']:
if entity['type'] in entity == 'Country' or entity['type'] == 'Region' or entity['type'] == 'City' or entity['type'] == 'StateOrCountry' or entity['type'] == 'Continent':
if entity.get('disambiguated'):
locations.append(entity['disambiguated']['name'])
relevance.append(entity['relevance'])
else:
locations.append(entity['text'])
relevance.append(entity['relevance'])
else:
locations.append('No Location')
relevance.append('0')
max_pos = relevance.index(max(relevance)) # get nth position of the highest relevancy score
master_locations.append(locations[max_pos]) #Use n to get nth position of location and store that location name to master_locations
del locations[0] # RESET LIST
del relevance[0] # RESET LIST
else:
print('Error in entity extraction call: ', response['statusInfo'])
get_all_links('http://www.reddit.com/r/worldnews') # Gets all URLs per article, then analyzes entity
for item in master_locations:
print(item)
但我认为出于某种原因,列表locations
和relevance
未被重置。我做错了吗?
印刷本的结果是:
Holland
Holland
Beirut
Beirut
Beirut
Beirut
Beirut
Beirut
Beirut
Beirut
Beirut
Beirut
Beirut
Beirut
Mogadishu
Mogadishu
Mogadishu
Mogadishu
Mogadishu
Mogadishu
Mogadishu
Mogadishu
Johor Bahru
(可能从列表中不被清除)
我已经低估了,因为这是一段长长的代码,大多不相关,可能已经被简化了很多。 http://sscce.org/ – Davidmh 2014-09-06 10:05:46