2015-06-28 25 views
3

在以下字符串:如何删除非字母字符并在R中将所有字母转换为小写?

"I may opt for a yam for Amy, May, and Tommy." 

如何删除非字母字符和转换所有字母为小写和R中每个单词中的字母排序?

同时,我尝试对句子中的单词进行排序并删除重复项。

+3

你能告诉我们[你尝试过什么(http://mattgemmell.com/what-have - 你试过/)到目前为止? – zero323

+1

你能提供一个示例字符串和预期的输出吗?要转换为小写字母,只需使用'tolower'。 – Molx

+2

“排序每个单词中的字母”? – hrbrmstr

回答

4
str <- "I may opt for a yam for Amy, May, and Tommy." 

## Clean the words (just keep letters and convert to lowercase) 
words <- strsplit(tolower(gsub("[^A-Za-z ]", "", str)), " ")[[1]] 

## split the words into characters and sort them 
sortedWords <- sapply(words, function(word) sort(unlist(strsplit(word, "")))) 

## Join the sorted letters back together 
sapply(sortedWords, paste, collapse="") 

# i  may  opt  for  a  yam  for  amy  may  and 
# "i" "amy" "opt" "for"  "a" "amy" "for" "amy" "amy" "adn" 
# tommy 
# "mmoty" 

## If you want to convert result back to string 
do.call(paste, lapply(sortedWords, paste, collapse="")) 
# [1] "i amy opt for a amy for amy amy adn mmoty" 
4

stringr将让你在所有的字符集在R和在C-速度工作,magrittr会让你使用管道成语满足您的需要运作良好:

library(stringr) 
library(magrittr) 

txt <- "I may opt for a yam for Amy, May, and Tommy." 

txt %>% 
    str_to_lower %>%           # lowercase 
    str_replace_all("[[:punct:][:digit:][:cntrl:]]", "") %>% # only alpha 
    str_replace_all("[[:space:]]+", " ") %>%     # single spaces 
    str_split(" ") %>%           # tokenize 
    extract2(1) %>%            # str_split returns a list 
    sort %>%             # sort 
    unique              # unique words 

    ## [1] "a"  "amy" "and" "for" "i"  "may" "opt" "tommy" "yam" 
+0

我想知道如果可能'strxto>% str_to_lower%>%str_replace_all(“[^ [:alpha:]']”,“”)%>%str_split(“+”)%> extract2(1)%> %sort%>% unique''可能更精简。 –

5

你可以使用stringi

library(stringi) 
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE)))) 

其中给出:

## [1] "a"  "amy" "and" "for" "i"  "may" "opt" "tommy" "yam" 

更新

按照mentionned由@DavidArenburg,我忽略了“排序单词中的字母”你的问题的一部分。您没有提供所需的输出,并没有直接的应用程序出现在脑海,但是,假设你要找出哪些单词有一个匹配的对应(0串的距离):

unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE)))) %>% 
    stringdistmatrix(., ., useNames = "strings", method = "qgram") %>% 

#  a amy and for i may opt tommy yam 
# a  0 2 2 4 2 2 4  6 2 
# amy 2 0 4 6 4 0 6  4 0 
# and 2 4 0 6 4 4 6  8 4 
# for 4 6 6 0 4 6 4  6 6 
# i  2 4 4 4 0 4 4  6 4 
# may 2 0 4 6 4 0 6  4 0 
# opt 4 6 6 4 4 6 0  4 6 
# tommy 6 4 8 6 6 4 4  0 4 
# yam 2 0 4 6 4 0 6  4 0 

    apply(., 1, function(x) sum(x == 0, na.rm=TRUE)) 

# a amy and for  i may opt tommy yam 
# 1  3  1  1  1  3  1  1  3 

字与一个以上的0每行("amy", "may", "yam")有炒货对口。

+1

现在我倾向于使用'stringr',因为它在引擎盖下使用'stringi',但是那个函数但是'stri_extract_all_words'看起来非常方便。我可能不得不回去使用'stringi'。 – hrbrmstr

+1

是的。 'stringr'更简单,但我发现'stringi'更灵活。 –

+0

@hrbrmstr我认为你们都忽略了“*每个单词内的字母排序*”部分 –

4

qdap包,我保持有bag_o_words功能,为了这个,效果很好:

txt <- "I may opt for a yam for Amy, May, and Tommy." 

library(qdap) 

unique(sort(bag_o_words(txt))) 

## [1] "a"  "amy" "and" "for" "i"  "may" "opt" "tommy" "yam" 
相关问题