2014-09-06 27 views
-1

我通过Reddit上多篇文章试图循环,经过每一篇文章,并提取相关的顶级实体(通过筛选获得最高关联得分完成),然后添加到列表master_locations在Python中每循环迭代清空列表?

from __future__ import print_function 
from alchemyapi import AlchemyAPI 
import json 
import urllib2 
from bs4 import BeautifulSoup 

alchemyapi = AlchemyAPI() 
reddit_url = 'http://www.reddit.com/r/worldnews' 
urls = [] 
locations = [] 
relevance = [] 
master_locations = [] 

def get_all_links(page): 
    html = urllib2.urlopen(page).read() 
    soup = BeautifulSoup(html) 
    for a in soup.find_all('a', 'title may-blank ', href=True): 
     urls.append(a['href']) 
     run_alchemy_entity_per_link(a['href']) 

def run_alchemy_entity_per_link(articleurl): 
    response = alchemyapi.entities('url', articleurl) 
    if response['status'] == 'OK': 
     for entity in response['entities']: 
      if entity['type'] in entity == 'Country' or entity['type'] == 'Region' or entity['type'] == 'City' or entity['type'] == 'StateOrCountry' or entity['type'] == 'Continent': 
       if entity.get('disambiguated'): 
        locations.append(entity['disambiguated']['name']) 
        relevance.append(entity['relevance']) 
       else: 
        locations.append(entity['text']) 
        relevance.append(entity['relevance'])   
      else: 
       locations.append('No Location') 
       relevance.append('0') 
     max_pos = relevance.index(max(relevance)) # get nth position of the highest relevancy score 
     master_locations.append(locations[max_pos]) #Use n to get nth position of location and store that location name to master_locations 
     del locations[0] # RESET LIST 
     del relevance[0] # RESET LIST 
    else: 
     print('Error in entity extraction call: ', response['statusInfo']) 

get_all_links('http://www.reddit.com/r/worldnews') # Gets all URLs per article, then analyzes entity 

for item in master_locations: 
    print(item) 

但我认为出于某种原因,列表locationsrelevance未被重置。我做错了吗?

印刷本的结果是:

Holland 
Holland 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Mogadishu 
Mogadishu 
Mogadishu 
Mogadishu 
Mogadishu 
Mogadishu 
Mogadishu 
Mogadishu 
Johor Bahru 

(可能从列表中不被清除)

+0

我已经低估了,因为这是一段长长的代码,大多不相关,可能已经被简化了很多。 http://sscce.org/ – Davidmh 2014-09-06 10:05:46

回答

0

del list[0]只删除列表中的第一项。

如果要删除所有项目,使用下列内容:

del list[:] 

list[:] = [] 
+0

尝试将列表更改为'locations [:] = []'和'relevance [:] = []',但是我得到一个'ValueError:max()arg是一个空序列错误。 – 2014-09-06 09:33:25

+0

@PhillipeDongwooHan,在'del'语句前用'if relevance:'守卫两行。 – falsetru 2014-09-06 09:35:07

+0

谢谢!这固定它!但是,你能简单解释一下为什么这样做有效吗为什么要放置一个if条件? – 2014-09-06 09:51:19

0

在你的情况,不要重复使用的清单,只要创建新的:

from __future__ import print_function 
from alchemyapi import AlchemyAPI 
import json 
import urllib2 
from bs4 import BeautifulSoup 

alchemyapi = AlchemyAPI() 
reddit_url = 'http://www.reddit.com/r/worldnews' 

def get_all_links(page): 
    html = urllib2.urlopen(page).read() 
    soup = BeautifulSoup(html) 
    urls = [] 
    master_locations = [] 
    for a in soup.find_all('a', 'title may-blank ', href=True): 
     urls.append(a['href']) 
     master_locations.append(run_alchemy_entity_per_link(a['href'])) 
    return urls, master_locations 

def run_alchemy_entity_per_link(articleurl): 
    response = alchemyapi.entities('url', articleurl) 
    if response['status'] != 'OK': 
     print('Error in entity extraction call: ', response['statusInfo']) 
     return 
    locations_with_relevance = [] 
    for entity in response['entities']: 
     if entity['type'] in ('Country', 'Region', 'City', 'StateOrCountry', 'Continent'): 
      if entity.get('disambiguated'): 
       location = entity['disambiguated']['name'] 
      else: 
       location = entity['text'] 
      locations_with_relevance.append((int(entity['relevance']), location)) 
     else: 
      locations_with_relevance.append((0, 'No Location')) 
    return max(locations_with_relevance)[1] 

def main(): 
    _urls, master_locations = get_all_links(reddit_url) # Gets all URLs per article, then analyzes entity 

    for item in master_locations: 
     print(item) 

if __name__ == '__main__': 
    main() 

当您有多个项目存储在列表中时,将项目放入一个元组中,并将元组放入一个列表中,而不是两个或多个sep愤怒的名单。

+0

嗯..试着运行你的代码,我得到了'TypeError:'列表'对象不可调用'? – 2014-09-06 09:32:09

+0

@PhillipeDongwooHan:改正。无论如何,它更多的是看代码并找出差异。 – Daniel 2014-09-06 10:03:10