2016-10-25 77 views
-1

编辑我有一个输入数据帧是这样的:的R - GSUB功能

enter image description here

我所要的输出是这样的:

enter image description here

请找我的解释下面。我真的不知道该给一个详细的解释超过了这个:(

enter image description here

让我解释一下....在输入数据集,对于具有COL1值“10”行,我想扫描COL2价值观,以“*” ......同样的逻辑也适用于具有重复COL1值的所有COL2值.. 我想使用GSUB功能的..

我更换任何重复的文本模式尝试gsub连同粘贴几次,我没有得到所需的输出,因为我不知道如何匹配里面的所有模式重复。

我已经问过这个问题。但由于我没有收到答复,我正在重新发布。

附加以下输入数据框的dput:

structure(list(COL1 = c(10L, 10L, 10L, 20L, 20L, 30L, 30L, 40L, 
40L, 40L, 50L, 50L, 50L), COL2 = c("mary has life", "Don mary has life", 
"Britto mary has life", "push them fur", "push them ", "yell at this", 
"this is yell at this", "Year", "Doggy", "Horse", "This is great job", 
"great job", "Donkey")), .Names = c("COL1", "COL2"), row.names = c(NA, 
-13L), class = "data.frame") 
+1

10你试过了什么?你已经得到[一个答案](http://stackoverflow.com/questions/40125508/r-eliminating-duplicate-values)这个问题。那有什么问题? – Jaap

+0

我尝试了同样的答案。我试着按照我的要求修改它。请注意这两个问题是不同的。任何读过它的人都会了解其中的差异。我还注意到,我想用这个gsub函数..我从来没有得到相关的答案。 – Rambo

回答

4

您可以编写运行gsub一组中的每个项目,并选择最短的更换功能(从本身不谈,当然):

fun <- function(col){ 
    matches <- sapply(col, function(x){gsub(x, '\\*', col)}); 
    diag(matches) <- NA; 
    apply(matches, 1, function(x){x[which.min(nchar(x))]}) 
} 

现在,在你最喜欢的语法实现:

library(dplyr) 

df %>% group_by(COL1) %>% mutate(COL3 = fun(COL2)) 

## Source: local data frame [13 x 3] 
## Groups: COL1 [5] 
## 
##  COL1     COL2   COL3 
## <int>    <chr>   <chr> 
## 1  10  mary has life mary has life 
## 2  10 Don mary has life   Don * 
## 3  10 Britto mary has life  Britto * 
## 4  20  push them fur   *fur 
## 5  20   push them  push them 
## 6  30   yell at this yell at this 
## 7  30 this is yell at this  this is * 
## 8  40     Year   Year 
## 9  40    Doggy   Doggy 
## 10 40    Horse   Horse 
## 11 50 This is great job  This is * 
## 12 50   great job  great job 
## 13 50    Donkey  Donkey 

或全部保留在底座R:

df$COL3 <- ave(df$COL2, df$COL1, FUN = fun) 

df 

## COL1     COL2   COL3 
## 1 10  mary has life mary has life 
## 2 10 Don mary has life   Don * 
## 3 10 Britto mary has life  Britto * 
## 4 20  push them fur   *fur 
## 5 20   push them  push them 
## 6 30   yell at this yell at this 
## 7 30 this is yell at this  this is * 
## 8 40     Year   Year 
## 9 40    Doggy   Doggy 
## 10 40    Horse   Horse 
## 11 50 This is great job  This is * 
## 12 50   great job  great job 
## 13 50    Donkey  Donkey 
+0

您提供的代码对上述输入正常工作。但举例来说,如果我有两个COL2值作为“鼠标鼠标”和“鼠标鼠标”,则这两个值将被替换为“*”,这是不可取的。只有一个值应该替换为“*”,另一个值应该保留为“鼠标鼠标” – Rambo

+0

@alistaire ...您提供的代码适用于上述输入。但举例来说,如果我有两个COL2值作为“鼠标鼠标”和“鼠标鼠标”,则这两个值将被替换为“*”,这是不可取的。只有一个值应该替换为“*”,另一个值应该保持为“鼠标鼠标” – Rambo

+0

添加一行以说明重复项,例如, 'fun < - function(col)col [duplicated(col)] < - '*'; 匹配< - sapply(col,function(x){gsub(x,'\\ *',col)}); diag(matches)< - col; apply(matches,1,function(x){x [which.min(nchar(x))]}) }' – alistaire