2016-11-13 112 views
0

我有来自kaggle.com的包含每集的标题的辛普森数据。我想检查每个标题中字符名称的使用次数。我可以在标题中找到确切的单词,但是当我寻找荷马时,我的代码错过了诸如Homers这样的单词。有没有办法做到这一点?如何检查字符串是否包含R中的特定单词

数据例子,我的代码:

text <- 'title 
Homer\'s Night Out 
Krusty Gets Busted 
Bart Gets an "F" 
Two Cars in Every Garage and Three Eyes on Every Fish 
Dead Putting Society 
Bart the Daredevil 
Bart Gets Hit by a Car 
Homer vs. Lisa and the 8th Commandment 
Oh Brother, Where Art Thou? 
Old Money 
Lisa\'s Substitute 
Blood Feud 
Mr. Lisa Goes to Washington 
Bart the Murderer 
Like Father, Like Clown 
Saturdays of Thunder 
Burns Verkaufen der Kraftwerk 
Radio Bart 
Bart the Lover 
Separate Vocations 
Colonel Homer' 

simpsons <- read.csv(text = text, stringsAsFactors = FALSE) 

library(stringr) 

titlewords <- paste(simpsons$title, collapse = " ") 
words <- c('Homer') 
titlewords <- gsub("[[:punct:]]", "", titlewords) 
HomerCount <- str_count(titlewords, paste(words, collapse=" ")) 
HomerCount 
+0

[选择行,其中一列可能的复制有一个字符串像'hsa ..'(部分字符串匹配)](http://stackoverflow.com/questions/13043928/selecting-rows-where-a-column-has-a-string-like-hsa-partial-字符串匹配) –

+1

你不只是想'sum(grepl('Homer',辛普森$ title))'? – rawr

+0

并为每个字符串计数'sapply(gregexpr(“Homer”,simpsons $ title),function(x)sum(x> 0))''。 –

回答

0

在一个替代的评论很好的建议,你也可以使用tidytext

library(tidytext) 
library(dplyr) 

text <- 'title 
Homer\'s Night Out 
Krusty Gets Busted 
Bart Gets an "F" 
Two Cars in Every Garage and Three Eyes on Every Fish 
Dead Putting Society 
Bart the Daredevil 
Bart Gets Hit by a Car 
Homer vs. Lisa and the 8th Commandment 
Oh Brother, Where Art Thou? 
Old Money 
Lisa\'s Substitute 
Blood Feud 
Mr. Lisa Goes to Washington 
Bart the Murderer 
Like Father, Like Clown 
Saturdays of Thunder 
Burns Verkaufen der Kraftwerk 
Radio Bart 
Bart the Lover 
Separate Vocations 
Colonel Homer' 

simpsons <- read.csv(text = text, stringsAsFactors = FALSE) 

# Number of homers 
simpsons %>% 
    unnest_tokens(word, title) %>% 
    summarize(count = sum(grepl("homer", word))) 

# Lines location of homers 
simpsons %>% 
    unnest_tokens(word, title) %>% 
    mutate(lines = rownames(.)) %>% 
    filter(grepl("homer", word)) 
相关问题