下面是使用ngrams
从nltk
一个小例子。希望它能帮助:
from nltk.util import ngrams
from nltk import word_tokenize
# Creating test dataframe
df = pd.DataFrame({'text': ['my first sentence',
'this is the second sentence',
'third sent of the dataframe']})
print(df)
输入dataframe
:
text
0 my first sentence
1 this is the second sentence
2 third sent of the dataframe
现在我们可以使用的n-gram与word_tokenize
沿着bigrams
和trigrams
和将其应用到数据帧中的每一行。对于bigram,我们将2
的值与标记化单词一起传递给ngrams函数,而对于卦则传递值为3
。 ngrams
返回的结果是generator
类型,所以它被转换为列表。对于每一行,列表bigrams
和trigrams
都保存在不同的列中。
df['bigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df['trigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 3)))
print(df)
结果:
text \
0 my first sentence
1 this is the second sentence
2 third sent of the dataframe
bigram \
0 [(my, first), (first, sentence)]
1 [(this, is), (is, the), (the, second), (second, sentence)]
2 [(third, sent), (sent, of), (of, the), (the, dataframe)]
trigram
0 [(my, first, sentence)]
1 [(this, is, the), (is, the, second), (the, second, sentence)]
2 [(third, sent, of), (sent, of, the), (of, the, dataframe)]
你怎么样1)不张贴图片2)不要张贴链接,图片3)_excel_数据的图片要少得多链接。 –
并阅读:http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples –
有一个'ngrams'函数在nltk这很容易做到这一点,采取一个参数的数字你想组合在一起的单词 – kev8484