只从评论列表中提取相关评论

继续我对文本分析的探索，我遇到了另一个障碍。我了解逻辑，但不知道如何在R中执行此操作。下面是我想要做的：我有2个CSV - 1.包含10,000条评论2.包含单词列表我想选择所有包含第二个CSV中任何单词的评论。我该怎么办？只从评论列表中提取相关评论

例如：

**CSV 1:** 
this is a sample set 
the comments are not real 
this is a random set of words 
hope this helps the problem case 
thankyou for helping out 
i have learned a lot here 
feel free to comment 

**CSV 2** 
sample 
set 
comment 

**Expected output:** 
this is a sample set 
the comments are not real 
this is a random set of words 
feel free to comment

请注意：不同形式的话也被认为是，例如，评论和意见都被认为。

来源

2016-05-25 eclairs

两者分别是评论和单词列表 – eclairs

你可以让你的例子重现吗？ – Sotos

我们可以在paste之后使用grep第二数据集中的元素。

v1 <- scan("file2.csv", what ="") 
lines1 <- readLines("file1.csv") 
grep(paste(v1, collapse="|"), lines1, value=TRUE) 
#[1] "this is a sample set"   "the comments are not real" 
#[3] "this is a random set of words" "feel free to comment"

来源

2016-05-25 09:39:57 akrun

首先创建两个对象称为从您的文件lines和words.to.match。你可以做这样的：

lines <- read.csv('csv1.csv', stringsAsFactors=F)[[1]] 
words.to.match <- read.csv('csv2.csv', stringsAsFactors=F)[[1]]

比方说，就像这样：

lines <- c(
    'this is a sample set', 
    'the comments are not real', 
    'this is a random set of words', 
    'hope this helps the problem case', 
    'thankyou for helping out', 
    'i have learned a lot here', 
    'feel free to comment' 
) 
words.to.match <- c('sample', 'set', 'comment')

然后，您可以用两个嵌套*apply-函数计算匹配：

matches <- mapply(
    function(words, line) 
     any(sapply(words, grepl, line, fixed=T)), 
    list(words.to.match), 
    lines 
) 
matched.lines <- lines[which(matches)]

这是怎么回事这里？我使用mapply来计算行中每行的函数，将words.to.match作为另一个参数。请注意，list(words.to.match)的基数为1.我只是在每个应用程序中回收这个参数。然后，在mapply函数中，我调用sapply函数来检查是否有任何单词与该行匹配（我通过grepl检查匹配）。

这不一定是最有效的解决方案，但它对我来说更容易理解。你可以计算matches另一种方法是：

matches <- lapply(words.to.match, grepl, lines, fixed=T) 
matches <- do.call("rbind", matches) 
matches <- apply(matches, c(2), any)

我不喜欢这个解决方案，因为你需要做一个do.call("rbind",...)，这是一个有点哈克。

来源

2016-05-25 09:55:01 bogdata

只从评论列表中提取相关评论

回答

相关问题