2014-04-27 25 views
1

例如,我有一个向量中的“计算机”元素。我需要得到一个由“c”,“o”,“m”,“p”,“u”,“t”,“e”,“r”组成的向量。如何通过字母拆分矢量中的某个元素?

而我的问题的第二部分是可选的。我如何创建一个包含上述矢量元素的字母组合的矢量,并且在结果组合中的字母将只按照原始单词中的顺序创建?例如,我想在这个矢量中取代“tumpo”之类的“puter”或“mpu”。

回答

1

对于问题的第一部分是很容易得到:

splits <- unlist(strsplit("computer",split="")) 

> splits 
[1] "c" "o" "m" "p" "u" "t" "e" "r" 

对于您可以使用下面的代码的第二部分:

subseqs <- 
    unlist(
    lapply(1:length(splits),FUN=function(x){ 
     lapply(1:(length(splits)+1-x),FUN=function(y){ 
     paste(splits[y:(y+x-1)],collapse="") }) 
    }) 
) 
> subseqs 
[1] "c"  "o"  "m"  "p"  "u"  "t"  "e"  
[8] "r"  "co"  "om"  "mp"  "pu"  "ut"  "te"  
[15] "er"  "com"  "omp"  "mpu"  "put"  "ute"  "ter"  
[22] "comp"  "ompu"  "mput"  "pute"  "uter"  "compu" "omput" 
[29] "mpute" "puter" "comput" "ompute" "mputer" "compute" "omputer" 
[36] "computer" 
3

您可以使用

strsplit("computer", "\\b") 

and

library("RWeka") 
gsub(" ", "", 
    NGramTokenizer(paste(strsplit("computer", "\\b")[[1]], collapse=" "), 
        Weka_control(min=2, 
           max=5)), 
    fixed=TRUE) 
# [1] "compu" "omput" "mpute" "puter" "comp" 
# [6] "ompu" "mput" "pute" "uter" "com" 
# [11] "omp" "mpu" "put" "ute" "ter" 
# [16] "co"  "om" "mp" "pu" "ut" 
# [21] "te" "er" 

用于创建n-grams,其中2 < = n < = 5。

0

连续三个字母组合:

x <- strsplit("computer", "\\b") 
y <- combn(seq(x),3); m <- match(1:6,y[1,]) 
combn (x,3)[,m] 

enter image description here

相关问题