Python：预处理文本

我正在尝试使用lemmatizer预处理一个字符串，然后删除标点符号和数字。我正在使用下面的代码来执行此操作。我没有收到任何错误，但文本没有被适当地预处理。只有停用词被删除，但词汇化不起作用，标点和数字也保留。Python：预处理文本

from nltk.stem import WordNetLemmatizer 
import string 
import nltk 
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34." 
lemmatizer = WordNetLemmatizer() 
tweets = lemmatizer.lemmatize(tweets) 
data=[] 
stop_words = set(nltk.corpus.stopwords.words('english')) 
words = nltk.word_tokenize(tweets) 
words = [i for i in words if i not in stop_words] 
data.append(' '.join(words)) 
corpus = " ".join(str(x) for x in data) 
p = string.punctuation 
d = string.digits 
table = str.maketrans(p, len(p) * " ") 
corpus.translate(table) 
table = str.maketrans(d, len(d) * " ") 
corpus.translate(table) 
print(corpus)

最终输出我得到的是：

This beautiful day16~ . I ; working exercise45.^^^45 text34 .

和预期的输出应该是这样的：

This beautiful day I work exercise text

来源

2017-10-16 Alex

我会使用正则表达式来摆脱噪音，调用lemmatizer之前。 –

谢谢你的建议。但是，上面的代码不应该像我期待的那样工作。我以前使用过相同的代码，但它工作正常，但不知道为什么这次不工作。 – Alex

不，你目前的方法是行不通的，因为你必须在一个时间通过一个字lemmatizer /词干，否则，这些功能将不知道要解释你的字符串作为句子（他们期待的话）。

import re __stop_words = set(nltk.corpus.stopwords.words('english')) def clean(tweet): cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower()) return ' '.join([lemmatizer.lemmatize(i, 'v') for i in cleaned_tweet.split() if i not in __stop_words])

或者，你可以使用一个PorterStemmer，它做同样的事情lemmatisation，但没有上下文。

from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer()

而且，这样调用的词干：

stemmer.stem(i)

来源

2017-10-16 21:43:59

嘿，你还可以告诉我，如果我的文本是一个数据帧列，我如何预处理文本。我想删除所有标点符号，数字和词汇化文本，并从一列数据框的所有行中删除停用词。 – Alex

@Ritika定义一个函数，然后将其传递给df.apply ... –

非常感谢:) – Alex

我想这就是你要找的东西，但做到这一点之前作为评论者注意到称为lemmatizer。

>>>import re 
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34." 
>>>s = re.sub(r'[^A-Za-z ]', '', s) 
This is a beautiful day I am working on an exercise text

来源

2017-10-16 21:37:15

Python：预处理文本

回答

相关问题