2017-09-18 50 views
1

我的数据是:dplyr/dt总结列是否不为空/ NA并粘贴?

Name  House Street  Apt City Postal Phone 
DUMA PAUL 2030 GREEN ROAD  DESERT Z0K2K1 999-577-3789 
DUNN S    GREEN ROAD  DESERT Z0K2K1 999-577-3256 
FERGUSON BOB  GREEN ROAD  DESERT Z0K2K1 999-577-3771 
FITSCHEN A 3989 GREEN ROAD  DESERT Z0K2K1 999-577-3557 
BLACK CARY 2079 GREEN ROAD  DESERT Z0K2K1 999-577-3779 
BLACK RUTH 2079 GREEN ROAD  DESERT Z0K2K1 999-577-3779 

我想比较名称(动态,数据由众议院排序),如果相等,房子#是平等的,连接具有各自的两个电话号码“OR”和删除行那不是连接起来并串联了名称“和”

我使用:

data <- data %>% 
    group_by(House, Street, Apt, City, Postal) %>% 
    summarise(Name = first(paste(Name, collapse = ", AND ")), Phone = 
    paste(unique(Phone), collapse = " OR ")) %>% 
    ungroup() %>% 
    arrange(Street, desc(House)) %>% 
    select(colnames(dataset)) %>% 
    filter(!Phone %in% dnc$`Home Phone`) 

问题:上述dplyr,我串联如果房子是NA (或空白,我把我的NA留空),Apt是NA(或“”),我不想。因此,使用上面的代码,我会

Name      House Street Apt City Postal Phone 
    DUNN S, AND FERGUSON BOB  GREEN ROAD  DESERT Z0K2K1 9995773256 
    OR 9995773772 
    DUMAS PAUL    2030 GREEN ROAD DESERT Z0K2K1 
    9995773789 
    BLACK CARY, AND BLACK RUTH 2079 GREEN ROAD DESERT Z0K2K1 
    9995773779 
    FITSCHEN A     3989 GREEN ROAD DESERT Z0K2K1 
    9995773556 

通过以上,请注意邓恩S,而现在弗格森BOB在一起。我不要那个。

dput(抱歉,如果没有帮助):

list(structure(list(X__1 = c(NA, NA, NA, NA, NA, NA), Name = c("DUMAS 
    PAUL", 
    "DUNN S", "FERGUSON BOB", "FITSCHEN A", "BLACK CARY", "BLACK RUTH" 
    ), House = c("2030", NA, NA, "3989", "2079", "2079"), Street = c("GREEN 
    ROAD", 
    "GREEN ROAD", "GREEN ROAD", "GREEN ROAD", "GREEN ROAD", "GREEN ROAD" 
    ), Apt = c(NA, NA, NA, NA, NA, NA), City = c("DESERT", "DESERT", 
    "DESERT", "DESERT", "DESERT", "DESERT"), Prov = c("ZK", "ZK", 
    "ZK", "ZK", "ZK", "ZK"), Postal = c("Z0K2K1", "Z0K2K1", "Z0K2K1", 
    "Z0K2K1", "Z0K2K1", "Z0K2K1"), Phone = c("999-577-3789", "999-577-3256", 
    "999-577-3772", "999-577-3556", "999-577-3779", "999-577-3779" 
    ), `Last Appear Date` = c(NA, NA, NA, NA, NA, NA)), .Names = c("X__1", 
    "Name", "House", "Street", "Apt", "City", "Prov", "Postal", "Phone", 
    "Last Appear Date"), class = c("tbl_df", "tbl", "data.frame"), row.names 
    = c(NA, 
    -6L))) 

感谢

回答

2

里面DT[, {...}, by=],你可以写几乎任何东西。在这种情况下,if... else作品:类似可dplyr::do做,大概

library(data.table) 
library(magrittr) 
DT = as.data.table(data) 

DT[, 
    if (!(is.na(House) & is.na(Apt))) 
    .(
     Name = Name %>% paste(collapse = ", AND "), 
     Phone = Phone %>% unique %>% paste(collapse = " OR ") 
    ) 
    else 
    .(Name, Phone) 
, by=.(House, Street, Apt, City, Postal)] 

    House   Street Apt City Postal      Name  Phone 
1: 2030 GREEN \n ROAD NA DESERT Z0K2K1   DUMAS \n PAUL 999-577-3789 
2: NA  GREEN ROAD NA DESERT Z0K2K1      DUNN S 999-577-3256 
3: NA  GREEN ROAD NA DESERT Z0K2K1    FERGUSON BOB 999-577-3772 
4: 3989  GREEN ROAD NA DESERT Z0K2K1     FITSCHEN A 999-577-3556 
5: 2079  GREEN ROAD NA DESERT Z0K2K1 BLACK CARY, AND BLACK RUTH 999-577-3779 

东西。

你不必在这里使用magrittr;这只是我对paste零件的偏好。您可能还需要在这些管道中添加%>% sort步骤(因此手机和名称列表始终是递增的)。

0

我想这个问题没有“漂亮”的解决方案,这是一个不适合dplyr工作流程的处理。一种解决方法是以某种方式唯一标识具有空数据的房屋。这样,他们不会被分组在一起。一种方法是在House为空时输入“#row_number”。现在他们不会被分组在一起,因为每一个空行都会有不同的数字。处理完成后,您可以简单地将#开头的值替换为空字符串或NA

data %>% 
    mutate(House = if_else(House == "" | is.na(House), paste0("#", row_number()), House)) %>% 
    # does the processing... %>% 
    mutate(House = if_else(startsWith(House, "#"), "", House))