重复ID检查

我有数据与人名和他们的ID号列表。有些人被列出两三次。每个人都有一个身份证号码 - 如果他们被列入多次，只要他是同一个人，他们的身份证号码将保持不变。像这样：重复ID检查

Name david david john john john john megan bill barbara chris chris 

ID  1  1 2 2 2 2 3 4 5 6 6

我需要确保这些ID号码是正确的，并且不同的人没有相同的ID号码。为此，我想创建一个新的变量来分配新的ID号码，以便我可以将新的ID号码与旧号码进行比较。我想创建一个命令，说 “如果他们的名字是相同的，使他们的ID号码相同”。我该怎么做？这有意义吗？

来源

2017-08-10 Rachel

独特的名称，添加ID，然后把它合并 – Wen

我将无法使用唯一的（名称），以原始数据集，因为这样的长度是不同的后合并？ – Rachel

您将可以合并。合并是基于公共值的查找功能。与Access或vlookup中的dlookup和Excel或Calc中的hlookup类似。 –

有很多方法可以做到这一点，其中一些是上面提出的。我通常使用dplyr版本来发现和删除重复/不好的情况。根据您的目标，以下是各种输出的示例。

library(dplyr) 

# example with one bad case 
dt = data.frame(Name = c("david","davud","John","John","megan"), 
       ID = c(1,1,2,3,3), stringsAsFactors = F) 


# spot names with more than 1 unique IDs 
dt %>% 
    group_by(Name) %>% 
    summarise(NumIDs = n_distinct(ID)) %>% 
    filter(NumIDs > 1) 

# # A tibble: 1 x 2 
# Name NumIDs 
# <chr> <int> 
# 1 John  2 


# spot names with more than 1 unique IDs and the actual IDs 
dt %>% 
    group_by(Name) %>% 
    mutate(NumIDs = n_distinct(ID)) %>% 
    filter(NumIDs > 1) %>% 
    ungroup() 

# # A tibble: 2 x 3 
# Name ID NumIDs 
# <chr> <dbl> <int> 
# 1 John  2  2 
# 2 John  3  2 


# spot names with more than 1 unique IDs and the actual IDs - alternative 
dt %>% 
    group_by(Name) %>% 
    mutate(NumIDs = n_distinct(ID)) %>% 
    filter(NumIDs > 1) %>% 
    group_by(Name, NumIDs) %>% 
    summarise(IDs = paste0(ID, collapse=",")) %>% 
    ungroup() 

# # A tibble: 1 x 3 
#  Name NumIDs IDs 
#  <chr> <int> <chr> 
# 1 John  2 2,3

来源

2017-08-11 12:08:39 AntoniosK

回答

相关问题