如何将列添加到基于另一列中的字符串的R中的data.table？

我想根据另一列中的字符串将列添加到data.table中。这是我的数据，我已经想尽了办法：如何将列添加到基于另一列中的字符串的R中的data.table？

 
                        Params 
1:         { clientID : 459; time : 1386868908703; version : 6} 
2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 
3:             { clientID : 988; time : 1388939739771} 
4: { clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001} 
5:             { clientID : 459; time : 1388090530634}

代码来创建此表：

DT = data.table(Params=c("{ clientID : 459; time : 1386868908703; version : 6}","{ clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001}","{ clientID : 988; time : 1388939739771}","{ clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001}","{ clientID : 459; time : 1388090530634}"))

我想分析的“PARAMS” -column文字和创建新列基于它的文字。例如，我希望有一个名为“user”的新列，它只保存Params字符串中的“user：”后面的数字。添加的列应该是这样的：

 
                        Params   user 
1:         { clientID : 459; time : 1386868908703; version : 6} NA 
2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 459001 
3:             { clientID : 988; time : 1388939739771} NA 
4: { clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001} 459001 
5:             { clientID : 459; time : 1388090530634} 459001

我创建了下面的函数解析（在本例中为“用户”）：

myparse <- function(searchterm, s) { 
    s <-gsub("{","",s, fixed = TRUE) 
    s <-gsub(" ","",s, fixed = TRUE) 
    s <-gsub("}","",s, fixed = TRUE) 
    s <-strsplit(s, '[;:]') 
    s <-unlist(s) 
    if (length(s[which(s==searchterm)])>0) {s[which(s==searchterm)+1]} else {NA} 
}

然后我用下面的函数添加一列：

DT <- transform(DT, user = myparse("user", Params))

这工作在包含在所有的行，但“用户”，这是仅包含在两排中的情况下不工作“时间”的情况。将返回以下错误：

Error in data.table(list(Params = c("{ clientID : 459; time : 1386868908703; version : 6}", : 
    argument 2 (nrow 2) cannot be recycled without remainder to match longest nrow (5)

我该如何解决这个问题？谢谢！

来源

2014-01-22 Miriam

下面是使用正则表达式完成这个任务的方式：

myparse <- function(searchterm, s) { 
    res <- rep(NA_character_, length(s)) # NA vector 
    idx <- grepl(searchterm, s) # index for strings including the search term 
    pattern <- paste0(".*", searchterm, " : ([^;}]+)[;}].*") # regex pattern 
    res[idx] <- sub(pattern, "\\1", s[idx]) # extract target string 
    return(res) 
}

您可以使用此功能来添加新列，例如，对于user：

DT[, user := myparse("user", Params)]

新列包含NA为没有user字段的行：

DT[, user] 
# [1] NA  "459001" NA  "459001" NA

来源

2014-01-22 12:18:46

非常感谢。适用于我提供的数据。我将如何调整正则表达式以允许像“{clientID：461; time：1386770861254; type：new; newUser：461002}”这样的字符串，其中包含类似“type：new”的字符串？ – Miriam

@Miriam这个例子应该是什么结果，''type：new“'或''new”'？ –

该列应该命名为“type”，值为“new”（如上面的用户：“459001”）。 – Miriam

我会用一些外部的解析器，例如：

library(yaml) 

DT = data.frame(
    Params=c("{ clientID : 459; time : 1386868908703; version : 6}","{ clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001}","{ clientID : 988; time : 1388939739771}","{ clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001}","{ clientID : 459; time : 1388090530634}"), 
    stringsAsFactors=F 
    ) 

conv.to.yaml <- function(x){ 
    gsub('; ','\n',substr(x, 3, nchar(x)-1)) 
} 

tmp <- lapply(DT$Params, function(x) yaml.load(conv.to.yaml(x)))

随后将分析清单合并为数据帧：

unames <- unique(unlist(sapply(tmp, names))) 
res <- as.data.frame( do.call(rbind, lapply(tmp, function(x)x[unames]))) 
colnames(res) <- unames 
res

结果是非常接近你心里有什么，但你需要考虑更好地处理时间值：

> res 
    clientID  time version      id user 
1  459 -405527905  6      NULL NULL 
2  459 -405612269 NULL 52a9ea8b534b2b0b5000575f 459001 
3  988 1665303163 NULL      NULL NULL 
4  459 -405626089 NULL 52a9ec00b73cbf0b210057e9 459001 
5  459 816094026 NULL      NULL NULL

来源

2014-01-22 13:27:52 df239

如何将列添加到基于另一列中的字符串的R中的data.table？

回答

相关问题