我的data.frame
的三列包含主题。我想为这个data.frame
对不同主题进行子集划分。例如。如果我想要一个data.frame
与主题“苹果”,应选择行,如果单词“苹果”出现在三列之一。根据R中不同列中的值选择行
doc <- c("blabla1", "blabla2", "blabla3", "blabla4")
subj.1 <- c("apple", "prune", "coconut", "berry")
subj.2 <- c("coconut", "apple", "cherry", "banana and prune")
subj.3 <- c("berry", "banana", "apple and berry", "pear", "prune")
subjects <- c("apple", "prune", "coconut", "berry", "cherry", "pear", "banana")
mydf <- data.frame(doc, subj.1, subj.2, subj.3, stringsAsFactors=FALSE)
mydf
# doc subj.1 subj.2 subj.3
# 1 blabla1 apple coconut berry
# 2 blabla2 prune apple banana
# 3 blabla3 coconut cherry apple and berry
# 4 blabla4 berry banana and prune pear
输出为主题的“苹果”应该是这样的:
# doc subj.1 subj.2 subj.3
# 1 blabla1 apple coconut berry
# 2 blabla2 prune apple banana
# 3 blabla3 coconut cherry apple and berry
EDIT1: 此外,比方说,我有大约200不同的主题和为此我要200个不同的data.frames。我怎么能这样做?
我试过一个循环的方法:
mylist <- vector('list', length(subjects))
for(i in 1:length(subjects)) {
pattern <- subjects[i]
filter <- grepl(pattern, ignore.case=T, mydf$subj.1)
grepl(pattern, ignore.case=T, mydf$subj.2)
grepl(pattern, ignore.case=T, mydf$subj.3)
subDF <- panel[filter,]
mylist[[i]] <- subDF
}
,但有错误:
Error in grepl(pattern, ignore.case = T, panel$SUBJECT.1) :
invalid regular expression 'C++ PROGRAMMING', reason 'Invalid use of repetition operators'
EDIT2:哦,我明白了,在原来的data.frame,主题之一是“C++程序设计”。可能是“++”导致错误?
这里有另外一种方法可以在你有很多列的时候创建'filter'变量:'filter <-apply(sapply(mydf [,2:4],grepl,pattern =“apple”,ignore.case = T ),1,any)' - 只要将'2:4'更改为你想要搜索的任何列 – MrFlick
@MrFlick,我在想'更独特(unlist(lapply(mydf [-1],function(x)) (grepl(“apple”,x)))))'。 – A5C1D2H2I1M1N2O1R2T1
@AnandaMahto好的建议。这在我的基准测试中似乎更有效率,它可以避免双重的* apply和'lapply' *基准测试http://www.r-fiddle.org/#/fiddle?id=mGlxBYaJ&version=1 – MrFlick