2017-05-01 59 views
-1

我想简单地处理一些Twitter数据,我想在其中计算数据集中产生的最频繁词汇。阅读CSV文件时列表索引超出范围

不过,我不断收到关于45号线以下错误:

IndexError Traceback (most recent call last) <ipython-input 346-f03e745247f4> in <module>() 
43 for line in f: 
44 parts = re.split("^\d+\s", line) 
45 tweet = re.split("\s(Status)", parts[-1])[10] 
46 tweet = tweet.replace("\\n"," ") 
47 terms_all = [term for term in process_tweet(tweet)] 
IndexError: list index out of range 

我已经加了我完整的代码进行审查,有人可以请告知。

import codecs 
import re 
from collections import Counter 
from nltk.corpus import stopwords 

word_counter = Counter() 

def punctuation_symbols(): 
    return [".", "", "$","%","&",";",":","-","&amp;","?"] 

def is_rt_marker(word): 
    if word == "b\"rt" or word == "b'rt" or word == "rt": 
     return True 
    return False 

def strip_quotes(word): 
    if word.endswith(""): 
     word = word[0:-1] 
    if word.startswith(""): 
     word = word[1:] 
    return word 

def process_tweet(tweet): 
    keep = [] 
    for word in tweet.split(" "): 
     word = word.lower() 
     word = strip_quotes(word) 
     if len(word) == 0: 
      continue 
     if word.startswith("https"): 
      continue 
     if word in stopwords.words('english'): 
      continue 
     if word in punctuation_symbols(): 
      continue 
     if is_rt_marker(word): 
      continue 
     keep.append(word) 
    return keep 

with codecs.open("C:\\Users\\XXXXX\\Desktop\\USA_TWEETS-out.csv", "r", encoding="utf-8") as f: 
    n = 0 
    for line in f: 
     parts = re.split("^\d+\s", line) 
     tweet = re.split("\s(Status)", parts[1])[0] 
     tweet = tweet.replace("\\n"," ") 
     terms_all = [term for term in process_tweet(tweet)] 
     word_counter.update(terms_all) 

     n += 1 
     if n == 50: 
      break 

print(word_counter.most_common(10)) 
+1

你分享的追踪引用的是不同于你粘贴在它下面的代码。特别是'tweet = re.split(“\ s(Status)”,parts [-1])[10]'与'tweet = re.split(“\ s(Status)”,parts [1] ]'。你能澄清吗? – etemple1

+0

@ etemple1:道歉也应该是1,0。我尝试了不同的组合,并且回溯是为之前的迭代生成的。任何想法为什么[1],[0]不会工作?还要澄清n = 0是否设置索引,并且[1]是否定义行开始正确? –

+0

顺便说一下'[term for process_tweet(tweet)]'相当于'list(process_tweet(tweet))',在你的情况下,它相当于'process_tweet(tweet)'。 – 9000

回答

-1
parts = re.split("^\d+\s", line) 
tweet = re.split("\s(Status)", parts[1])[0] 

这很可能是有问题的线路。

您认为parts确实分裂并且具有多个元素。分割可能无法找到line中的分割字符串,因此parts等于[line]。然后parts[1]崩溃。

在第二行之前添加一个检查。打印line值以更好地了解发生了什么。