2017-05-05 53 views
3

我尝试从涅text文本中提取3克,因此对于tfis我使用ngramrr包。提取ngram与R

require(ngramrr) 
require(tm) 
require(magrittr) 

nirvana <- c("hello hello hello how low", "hello hello hello how low", 
      "hello hello hello how low", "hello hello hello", 
      "with the lights out", "it's less dangerous", "here we are now", "entertain us", 
      "i feel stupid", "and contagious", "here we are now", "entertain us", 
      "a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay") 

ngramrr(nirvana[1], ngmax = 3) 

Corpus(VectorSource(nirvana)) 

我得到这样的结果:

[1] "hello"    "hello"    "hello"    "how"    "low"    "hello hello"  "hello hello"  
[8] "hello how"   "how low"   "hello hello hello" "hello hello how" "hello how low" 

我想知道我该怎么做才能构建TermDocumentMatrix其中术语是卦名单。

谢谢

+0

我会用'quanteda'并转换为'tm'格式。 'nirvana%>%tokens(ngrams = 1:3)%>%dfm%>%convert(to =“tm”)' –

+0

@amatsuo_net谢谢你,你能帮我一个R例子吗? –

+0

@Cath谢谢;) –

回答

1

上面我的意见是几乎完成,但它是这样的:

nirvana %>% tokens(ngrams = 1:3) %>% # generate tokens 
    dfm %>% # generate dfm 
    convert(to = "tm") %>% # convert to tm's document-term-matrix 
    t # transpose it to term-document-matrix