使用tm-package进行文本挖掘 - 词语词干

我正在使用tm -package进行R中的一些文本挖掘。一切都很顺利。但是，在阻塞之后会出现一个问题（http://en.wikipedia.org/wiki/Stemming）。显然，有一些词汇具有相同的词干，但重要的是它们不是“一起”（因为这些词语意味着不同的东西）。使用tm-package进行文本挖掘 - 词语词干

例如，请参阅下面的4个文本。在这里你不能使用“讲师”或“讲座”（“协会”和“同伴”）互换。但是，这是在步骤4中完成的。

是否有任何优雅的解决方案如何对某些案例/单词进行手动实现（例如，“讲师”和“讲座”保留为两个不同的东西）？

texts <- c("i am member of the XYZ association", 
"apply for our open associate position", 
"xyz memorial lecture takes place on wednesday", 
"vote for the most popular lecturer") 

# Step 1: Create corpus 
corpus <- Corpus(DataframeSource(data.frame(texts))) 

# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion 
corpus.copy <- corpus 

# Step 3: Stem words in the corpus 
corpus.temp <- tm_map(corpus, stemDocument, language = "english") 

inspect(corpus.temp) 

# Step 4: Complete the stems to their original form 
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy) 

inspect(corpus.final)

来源

2013-04-17 majom

这是干扰点。你这样做是为了获取根词。如果你想保留差异，那就不要干涉。 –

我知道。但是，在某些情况下，是否有一种优雅的方式来改变它？ – majom

我不是100％你在做什么，也不完全知道tm_map如何工作。如果我明白了下面的作品。据我所知，你想提供一个不应该被阻止的单词列表。我使用qdap包主要是因为我很懒，它有我喜欢的功能mgsub。

注意，我很沮丧使用mgsub和tm_map，因为它不停地抛出一个错误，所以我只是用lapply代替。

texts <- c("i am member of the XYZ association", 
    "apply for our open associate position", 
    "xyz memorial lecture takes place on wednesday", 
    "vote for the most popular lecturer") 

library(tm) 
# Step 1: Create corpus 
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts))) 

library(qdap) 
# Step 2: list to retain and indentifier keys 
retain <- c("lecturer", "lecture") 
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_") 

# Step 3: sub the words you want to retain with identifier keys 
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace) 

# Step 4: Stem it 
corpus.temp <- tm_map(corpus, stemDocument, language = "english") 

# Step 5: reverse -> sub the identifier keys with the words you want to retain 
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain) 

inspect(corpus)  #inspect the pieces for the folks playing along at home 
inspect(corpus.copy) 
inspect(corpus.temp) 

# Step 6: complete the stem 
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy) 
inspect(corpus.final)

基本上它的工作原理是：

胶层出去所提供的 “NO STEM” 字样的唯一标识符键（mgsub）
则干（使用stemDocument）
接下来将其翻转并将标识符键与“NO STEM”字（mgsub）
最后完成干（stemCompletion）

下面是输出：

## >  inspect(corpus.final) 
## A corpus with 4 text documents 
## 
## The metadata consists of 2 tag-value pairs and a data frame 
## Available tags are: 
## create_date creator 
## Available variables in the data frame are: 
## MetaID 
## 
## $`1` 
## i am member of the XYZ associate 
## 
## $`2` 
## for our open associate position 
## 
## $`3` 
## xyz memorial lecture takes place on wednesday 
## 
## $`4` 
## vote for the most popular lecturer

来源

2013-04-18 00:01:27

感谢您的帮助。很棒。 – majom

您也可以使用下面的包steeming话：https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf。

你只需要使用的功能词干，传递加以遏制的话的载体，也是语言你正在处理。要知道需要使用的确切语言字符串，可以参考方法getStemLanguages，它将返回所有可能的选项。

亲切的问候

来源

2017-07-04 02:06:51 brunoazev

使用tm-package进行文本挖掘 - 词语词干

回答

相关问题