阅读CSV文件时列表索引超出范围

-1

我想简单地处理一些Twitter数据，我想在其中计算数据集中产生的最频繁词汇。阅读CSV文件时列表索引超出范围

不过，我不断收到关于45号线以下错误：

IndexError Traceback (most recent call last) <ipython-input 346-f03e745247f4> in <module>() 
43 for line in f: 
44 parts = re.split("^\d+\s", line) 
45 tweet = re.split("\s(Status)", parts[-1])[10] 
46 tweet = tweet.replace("\\n"," ") 
47 terms_all = [term for term in process_tweet(tweet)] 
IndexError: list index out of range

我已经加了我完整的代码进行审查，有人可以请告知。

import codecs 
import re 
from collections import Counter 
from nltk.corpus import stopwords 

word_counter = Counter() 

def punctuation_symbols(): 
    return [".", "", "$","%","&",";",":","-","&amp;","?"] 

def is_rt_marker(word): 
    if word == "b\"rt" or word == "b'rt" or word == "rt": 
     return True 
    return False 

def strip_quotes(word): 
    if word.endswith(""): 
     word = word[0:-1] 
    if word.startswith(""): 
     word = word[1:] 
    return word 

def process_tweet(tweet): 
    keep = [] 
    for word in tweet.split(" "): 
     word = word.lower() 
     word = strip_quotes(word) 
     if len(word) == 0: 
      continue 
     if word.startswith("https"): 
      continue 
     if word in stopwords.words('english'): 
      continue 
     if word in punctuation_symbols(): 
      continue 
     if is_rt_marker(word): 
      continue 
     keep.append(word) 
    return keep 

with codecs.open("C:\\Users\\XXXXX\\Desktop\\USA_TWEETS-out.csv", "r", encoding="utf-8") as f: 
    n = 0 
    for line in f: 
     parts = re.split("^\d+\s", line) 
     tweet = re.split("\s(Status)", parts[1])[0] 
     tweet = tweet.replace("\\n"," ") 
     terms_all = [term for term in process_tweet(tweet)] 
     word_counter.update(terms_all) 

     n += 1 
     if n == 50: 
      break 

print(word_counter.most_common(10))

来源

2017-05-01 Ankhit Sharma

你分享的追踪引用的是不同于你粘贴在它下面的代码。特别是'tweet = re.split（“\ s（Status）”，parts [-1]）[10]'与'tweet = re.split（“\ s（Status）”，parts [1] ]'。你能澄清吗？ – etemple1

@ etemple1：道歉也应该是1,0。我尝试了不同的组合，并且回溯是为之前的迭代生成的。任何想法为什么[1]，[0]不会工作？还要澄清n = 0是否设置索引，并且[1]是否定义行开始正确？ –

顺便说一下'[term for process_tweet（tweet）]'相当于'list（process_tweet（tweet））'，在你的情况下，它相当于'process_tweet（tweet）'。 – 9000

-1

parts = re.split("^\d+\s", line) 
tweet = re.split("\s(Status)", parts[1])[0]

这很可能是有问题的线路。

您认为parts确实分裂并且具有多个元素。分割可能无法找到line中的分割字符串，因此parts等于[line]。然后parts[1]崩溃。

在第二行之前添加一个检查。打印line值以更好地了解发生了什么。

来源

2017-05-01 18:36:04 9000

阅读CSV文件时列表索引超出范围

回答

相关问题