如果字符串列中存在$符号，子集数据帧

我有一个dataframe,time列和string列。我想要subset这个dataframe - 在那里我只保留其中string列包含$符号的行。

子集后，我要清理string列，使其只包含characters的$符号之后，直到有一个space或symbol

df <- data.frame("time"=c(1:10), 
"string"=c("$ABCD test","test","test $EFG test", 
"$500 test","$HI/ hello","test $JK/", 
"testing/123","$MOO","$abc","123"))

我想最终的输出是：

Time string 
1  ABCD 
3  EFG 
4  500 
5  HI 
6  JK 
8  MOO 
9  abc

它只保留在字符串列中有$的行，然后只保留之后的字符10个符号，并直至space或symbol

我已经取得了一些成功sub简单地拉出string，但一直没能适用于该df和其子集。谢谢你的帮助。

来源

2017-03-25 newtoR

我们可以通过regexpr/regmatches提取子这样做仅提取遵循$

i1 <- grep("$", df$string, fixed = TRUE) 
transform(df[i1,], string = regmatches(string, regexpr("(?<=[$])\\w+", string, perl = TRUE))) 
# time string 
#1 1 ABCD 
#3 3 EFG 
#4 4 500 
#5 5  HI 
#6 6  JK 
#8 8 MOO 
#9 9 abc

子

或与tidyverse语法

library(tidyverse) 
df %>% 
    filter(str_detect(string, fixed("$"))) %>% 
    mutate(string = str_extract(string, "(?<=[$])\\w+"))

来源

2017-03-26 04:23:44 akrun

直到有人想出了漂亮regex解决方案，这是我的看法：

# subset for $ signs and convert to character class 
res <- df[ grepl("$", df$string, fixed = TRUE),] 
res$string <- as.character(res$string) 

# split on non alpha and non $, and grab the one with $, then remove $ 
res$clean <- sapply(strsplit(res$string, split = "[^a-zA-Z0-9$']", perl = TRUE), 
        function(i){ 
         x <- i[grepl("$", i, fixed = TRUE)] 
         # in case when there is more than one $ 
         # x <- i[grepl("$", i, fixed = TRUE)][1] 
         gsub("$", "", x, fixed = TRUE) 
        }) 
res 
# time   string clean 
# 1 1  $ABCD test ABCD 
# 3 3 test $EFG test EFG 
# 4 4  $500 test 500 
# 5 5  $HI/ hello HI 
# 6 6  test $JK/ JK 
# 8 8   $MOO MOO 
# 9 9   $abc abc

来源

2017-03-25 22:23:35 zx8754

这真是太好了，谢谢。有一件事我在我没有预见到的数据集上运行时遇到了 - 有些字符串实际上有多次出现'$ string' - 例如，一个值可能是$ ABCD test $ EBC和$ FB' - 这产生了一个值c（“ABCD”，“EBC”，“FB”）'。是否有可能只存储第一次出现？谢谢！ – newtoR

@newtoR使用这一行来获得只有第一个出现'x < - i [grepl（“$”，i，fixed = TRUE）] [1]'，作为注释添加到帖子中 – zx8754

如果字符串列中存在$符号，子集数据帧

回答

相关问题