R：查找名称的所有匹配项

-1

我一直在研究这一课的问题，最后得到了测验所需的答案。对于R来说，我还不太熟悉，但是这需要几个小时才能理解。我的任务是从丛林找到名称Jurgis，Ona和Chicago的所有事件。R：查找名称的所有匹配项

问题：我浪费了很多时间使用GSUB去除标点符号，但后来意识到，有些要素是两个字：“Jurgis读”会凝结成“Jurgisread”，不会拿起计数。然后有“Jurgis”凝聚到Ona和芝加哥市的“Jurgiss”等。

想：关于如何在将来更好地处理这些类型的文件的一些提示。

我做了什么：我得到了开头的两行代码。我使用它们附带的空格分割元素。然后，我选择了我想要删除的标点符号。一旦我移除，我认为，将是所有常见的，并用空格替换它们，再次分割元素。最后，我table（）并强迫所有的单词都是大写字母。

theJungle <- readLines("http://www.gutenberg.org/files/140/140.txt") 
theJungleList <- unlist(strsplit(theJungle[47:13872], " ")) 

splitJungle1<-unlist(strsplit(theJungleList, "[[:space:]]", fixed = FALSE, 
perl = FALSE, useBytes = FALSE)) 

remPunctuation<-gsub("-|'|,|:|;|\\.|\\*|\\(|\"|!|\\?"," ",splitJungle1) 

splitJungle2<-unlist(strsplit(remPunctuation, "[[:space:]]", fixed = FALSE, perl 
= FALSE, useBytes = FALSE)) 

table(toupper(splitJungle2)=="JURGIS") 
table(toupper(splitJungle2)=="ONA") 
table(toupper(splitJungle2)=="CHICAGO")

谢谢！

enter image description here

来源

2017-05-02 Melissa Perez

请参阅：为什么“有人能帮助我吗？”不是一个实际的问题？（http://meta.stackoverflow.com/q/284236） – EJoshuaS

如果这是一类，你可能应该使用某些技术。如果你只是对R中的文本分析感兴趣，你可以考虑使用整齐的数据原理和tidytext包。在这种工作模式下寻找单词频率是pretty quick thing to do。

library(dplyr) 
library(tidytext) 
library(stringr) 

theJungle <- readLines("http://www.gutenberg.org/files/140/140.txt") 
jungle_df <- data_frame(text = theJungle) %>% 
    unnest_tokens(word, text)

什么是文本中最常见的词？

jungle_df %>% 
    count(word, sort = TRUE) 

#> # A tibble: 10,349 × 2 
#>  word  n 
#> <chr> <int> 
#> 1 the 9114 
#> 2 and 7350 
#> 3  of 4484 
#> 4  to 4270 
#> 5  a 4217 
#> 6  he 3312 
#> 7 was 3056 
#> 8  in 2570 
#> 9  it 2318 
#> 10 had 2234 
#> # ... with 10,339 more rows

你经常看到你要找的具体名称？

jungle_df %>% 
    count(word) %>% 
    filter(str_detect(word, "^jurgis|^ona|^chicago")) 

#> # A tibble: 6 × 2 
#>  word  n 
#>  <chr> <int> 
#> 1 chicago 68 
#> 2 chicago's  4 
#> 3 jurgis 1098 
#> 4 jurgis's 19 
#> 5  ona 200 
#> 6  ona's 25

来源

2017-05-03 00:07:18

哇，谢谢。我后来在路上感兴趣，但是是上课的。本周的首要主题是字符串操作，所以我们还没有使用tidytext，但这是一个方便的知识包。 –

伟大的答案 - 非常简单的使用“计数”！ – griffmer

R：查找名称的所有匹配项

回答

相关问题