2016-09-30 53 views
1

我有一个有两列的数据帧。一列包含句子列表,另一列包含单词。例如:根据两列之间的匹配值(精确)过滤数据帧

words sentences 
loose Loose connection several times a day on my tablet. 
loud People don't speak loud or clear enough to hear voicemails 
vice I strongly advice you to fix this issue 
advice I strongly advice you to fix this issue 

现在我要过滤这些数据帧,这样我只得到具有恰好匹配句子中的单词的那些行:

words sentences 
loose Loose connection several times a day on my tablet. 
loud People don't speak loud or clear enough to hear voicemails 
advice I strongly advice you to fix this issue 

这个词“副”并不完全匹配,因此必须将其删除。我在数据框中有近20k行。有人可以建议我使用哪种方法来完成这项任务,这样我就不会失去太多的表现。

回答

2

您可以尝试类似如下:

df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),] 

df 
    words            sentences 
1 loose  Loose connection several times a day on my tablet. 
2 loud People dont speak loud or clear enough to hear voicemail 
4 advice   advice I strongly advice you to fix this issue 
+0

这种方法比使用str_detect更快,因此接受这个答案。 – Venu

1

最简单的办法是使用stringr包:

df<- data.frame(words=c("went","zero", "vice"), sent=c("a man went to the park","one minus one is 0","any advice?")) 

df$words <- paste0(" ",df$words," ") 
df$sent <- paste0(" ",df$sent," ") 


df$match <- str_detect(df$sent,df$words) 

df.res <- df[df$match > 0,] 
df.res$match<-NULL 
df.res 
+0

这不会给OP的数据提供首选输出。 – Jaap

+0

编辑,还是这样? –

+0

现在工作,但它肯定不是最简单的解决方案了。此外,“发送”栏的内容已经改变,这不是OP的意图。 – Jaap

3

使用:

library(stringi) 
df[stri_detect_regex(tolower(df$sentences), paste0('\\b',df$words,'\\b')),] 

你:

words             sentences 
1 loose   Loose connection several times a day on my tablet. 
2 loud People don't speak loud or clear enough to hear voicemails 
4 advice     I strongly advice you to fix this issue 

说明:

  • 转换句子中的资金,以小写字母与tolower
  • 通过wordboundaries(\\b)包裹在words词语创建paste0一个正则表达式矢量。
  • 使用来自stringi-package的stri_detect_regex来查看哪些行中没有匹配,从而产生具有TRUE & FALSE值的逻辑向量。
  • 具有逻辑向量的子集。

作为替代方案,也可以使用str_detectstringr包(实际上是围绕stringi包的包装):

library(stringr) 
df[str_detect(tolower(df$sentences), paste0('\\b',df$words,'\\b')),] 

二手数据:

df <- structure(list(words = c("loose", "loud", "vice", "advice"), 
        sentences = c("Loose connection several times a day on my tablet.", 
            "People don't speak loud or clear enough to hear voicemails", 
            "I strongly advice you to fix this issue", "I strongly advice you to fix this issue")), 
       .Names = c("words", "sentences"), class = "data.frame", row.names = c(NA, -4L)) 
相关问题