-1
我想简单地处理一些Twitter数据,我想在其中计算数据集中产生的最频繁词汇。阅读CSV文件时列表索引超出范围
不过,我不断收到关于45号线以下错误:
IndexError Traceback (most recent call last) <ipython-input 346-f03e745247f4> in <module>()
43 for line in f:
44 parts = re.split("^\d+\s", line)
45 tweet = re.split("\s(Status)", parts[-1])[10]
46 tweet = tweet.replace("\\n"," ")
47 terms_all = [term for term in process_tweet(tweet)]
IndexError: list index out of range
我已经加了我完整的代码进行审查,有人可以请告知。
import codecs
import re
from collections import Counter
from nltk.corpus import stopwords
word_counter = Counter()
def punctuation_symbols():
return [".", "", "$","%","&",";",":","-","&","?"]
def is_rt_marker(word):
if word == "b\"rt" or word == "b'rt" or word == "rt":
return True
return False
def strip_quotes(word):
if word.endswith(""):
word = word[0:-1]
if word.startswith(""):
word = word[1:]
return word
def process_tweet(tweet):
keep = []
for word in tweet.split(" "):
word = word.lower()
word = strip_quotes(word)
if len(word) == 0:
continue
if word.startswith("https"):
continue
if word in stopwords.words('english'):
continue
if word in punctuation_symbols():
continue
if is_rt_marker(word):
continue
keep.append(word)
return keep
with codecs.open("C:\\Users\\XXXXX\\Desktop\\USA_TWEETS-out.csv", "r", encoding="utf-8") as f:
n = 0
for line in f:
parts = re.split("^\d+\s", line)
tweet = re.split("\s(Status)", parts[1])[0]
tweet = tweet.replace("\\n"," ")
terms_all = [term for term in process_tweet(tweet)]
word_counter.update(terms_all)
n += 1
if n == 50:
break
print(word_counter.most_common(10))
你分享的追踪引用的是不同于你粘贴在它下面的代码。特别是'tweet = re.split(“\ s(Status)”,parts [-1])[10]'与'tweet = re.split(“\ s(Status)”,parts [1] ]'。你能澄清吗? – etemple1
@ etemple1:道歉也应该是1,0。我尝试了不同的组合,并且回溯是为之前的迭代生成的。任何想法为什么[1],[0]不会工作?还要澄清n = 0是否设置索引,并且[1]是否定义行开始正确? –
顺便说一下'[term for process_tweet(tweet)]'相当于'list(process_tweet(tweet))',在你的情况下,它相当于'process_tweet(tweet)'。 – 9000