如何删除非字母字符并在R中将所有字母转换为小写？

在以下字符串：如何删除非字母字符并在R中将所有字母转换为小写？

"I may opt for a yam for Amy, May, and Tommy."

如何删除非字母字符和转换所有字母为小写和R中每个单词中的字母排序？

同时，我尝试对句子中的单词进行排序并删除重复项。

来源

2015-06-28 Yanyan

你能告诉我们[你尝试过什么（http://mattgemmell.com/what-have - 你试过/）到目前为止？ – zero323

你能提供一个示例字符串和预期的输出吗？要转换为小写字母，只需使用'tolower'。 – Molx

“排序每个单词中的字母”？ – hrbrmstr

str <- "I may opt for a yam for Amy, May, and Tommy." 

## Clean the words (just keep letters and convert to lowercase) 
words <- strsplit(tolower(gsub("[^A-Za-z ]", "", str)), " ")[[1]] 

## split the words into characters and sort them 
sortedWords <- sapply(words, function(word) sort(unlist(strsplit(word, "")))) 

## Join the sorted letters back together 
sapply(sortedWords, paste, collapse="") 

# i  may  opt  for  a  yam  for  amy  may  and 
# "i" "amy" "opt" "for"  "a" "amy" "for" "amy" "amy" "adn" 
# tommy 
# "mmoty" 

## If you want to convert result back to string 
do.call(paste, lapply(sortedWords, paste, collapse="")) 
# [1] "i amy opt for a amy for amy amy adn mmoty"

来源

2015-06-28 02:07:21 jenesaisquoi

stringr将让你在所有的字符集在R和在C-速度工作，magrittr会让你使用管道成语满足您的需要运作良好：

library(stringr) 
library(magrittr) 

txt <- "I may opt for a yam for Amy, May, and Tommy." 

txt %>% 
    str_to_lower %>%           # lowercase 
    str_replace_all("[[:punct:][:digit:][:cntrl:]]", "") %>% # only alpha 
    str_replace_all("[[:space:]]+", " ") %>%     # single spaces 
    str_split(" ") %>%           # tokenize 
    extract2(1) %>%            # str_split returns a list 
    sort %>%             # sort 
    unique              # unique words 

    ## [1] "a"  "amy" "and" "for" "i"  "may" "opt" "tommy" "yam"

来源

2015-06-28 03:04:12 hrbrmstr

我想知道如果可能'strxto>％ str_to_lower％>％str_replace_all（“[^ [：alpha：]']”，“”）％>％str_split（“+”）％> extract2（1）％> ％sort％>％ unique''可能更精简。 –

你可以使用stringi

library(stringi) 
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE))))

其中给出：

## [1] "a"  "amy" "and" "for" "i"  "may" "opt" "tommy" "yam"

更新

按照mentionned由@DavidArenburg，我忽略了“排序单词中的字母”你的问题的一部分。您没有提供所需的输出，并没有直接的应用程序出现在脑海，但是，假设你要找出哪些单词有一个匹配的对应（0串的距离）：

unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE)))) %>% 
    stringdistmatrix(., ., useNames = "strings", method = "qgram") %>% 

#  a amy and for i may opt tommy yam 
# a  0 2 2 4 2 2 4  6 2 
# amy 2 0 4 6 4 0 6  4 0 
# and 2 4 0 6 4 4 6  8 4 
# for 4 6 6 0 4 6 4  6 6 
# i  2 4 4 4 0 4 4  6 4 
# may 2 0 4 6 4 0 6  4 0 
# opt 4 6 6 4 4 6 0  4 6 
# tommy 6 4 8 6 6 4 4  0 4 
# yam 2 0 4 6 4 0 6  4 0 

    apply(., 1, function(x) sum(x == 0, na.rm=TRUE)) 

# a amy and for  i may opt tommy yam 
# 1  3  1  1  1  3  1  1  3

字与一个以上的0每行（"amy", "may", "yam"）有炒货对口。

来源

2015-06-28 11:18:22

现在我倾向于使用'stringr'，因为它在引擎盖下使用'stringi'，但是那个函数但是'stri_extract_all_words'看起来非常方便。我可能不得不回去使用'stringi'。 – hrbrmstr

是的。 'stringr'更简单，但我发现'stringi'更灵活。 –

@hrbrmstr我认为你们都忽略了“*每个单词内的字母排序*”部分 –

的qdap包，我保持有bag_o_words功能，为了这个，效果很好：

txt <- "I may opt for a yam for Amy, May, and Tommy." 

library(qdap) 

unique(sort(bag_o_words(txt))) 

## [1] "a"  "amy" "and" "for" "i"  "may" "opt" "tommy" "yam"

来源

2015-06-28 12:26:49

如何删除非字母字符并在R中将所有字母转换为小写？

回答

相关问题