2015-02-12 76 views
1

,我有以下数据,其中居民人按年龄排序(旧到新):计算分组的行之间的最大区别

data houses;    
input HouseID PersonID Age;  
datalines;    
1 1 25      
1 2 20     
2 1 32 
2 2 16 
2 3 14 
2 4 12 
3 1 44 
3 2 42 
3 3 10 
3 4 5 
; 
run; 

我想计算每个家庭连续岁之间的最大年龄差人。因此,这个例子将连续为住户1,2和3提供5(= 25-20),16(= 32-16)和32(= 42-10)的值。

我可以使用大量合并(即提取人员1,合并提取人员2等),但因为可以有多达20多人在一个家庭中,我正在寻找更多直接法。

回答

6

这是一个双通解决方案。与上述两种解决方案相同的第一步,按年龄分类。在第二步中,跟踪每行的max_diff,在HouseID的最后一个记录中输出结果。这导致只有两次通过数据。

proc sort data=houses; by houseid age;run; 

data want; 
set houses; 
by houseID; 

retain max_diff 0; 

diff = dif1(age)*-1; 

if first.HouseID then do; 
    diff = .; max_diff=.; 
end; 

if diff>max_diff then max_diff=diff; 
if last.houseID then output; 

keep houseID max_diff; 
run; 
+0

只要注意开始时,OP表示应该按年龄递减。它在这里工作,因为personid似乎首先按照最老的人的顺序分配,然而实际数据可能并非如此。 – Longfish 2015-02-12 10:30:07

+0

你是对的,我只是复制并粘贴了最初的代码。我将编辑解决方案,谢谢! – Reeza 2015-02-12 14:37:09

2
proc sort data=houses; by houseid personid age;run; 

data _t1; 
set houses; 
diff = dif1(age) * (-1); 
if personid = 1 then diff = .; 
run; 


proc sql; 
create table want as 
select houseid, max(diff) as Max_Diff 
from _t1 
group by houseid; 
+0

只是在开始的时候要小心,操作程序声明它应该是按年龄递减。它在这里工作,因为personid似乎首先按照最老的人的顺序分配,然而实际数据可能并非如此。 – Longfish 2015-02-12 10:30:30

+0

好评。在这种情况下,personid是通过家庭内的年龄降序来归因的,但对于其他用户的问题,情况可能并非如此。 – user2568648 2015-02-12 10:41:25

+0

完全代码的答案通常不被认为是一个好答案。答案应该解释他们的工作方式/原因 - 事实上,答案的一部分比代码更重要。 – Joe 2015-02-12 15:37:01

2
proc sort data = house; 
by houseid descending age; 
run; 

data house; 
set house; 
by houseid; 
lag_age = lag1(age); 
if first.houseid then age_diff = 0; 
age_diff = lag_age - age; 
run; 

proc sql; 
select houseid,max(age_diff) as max_age_diff 
from house 
group by houseid; 
quit; 

工作:

首先排序的数据集采用houseid和下降时期。 第二个数据步骤将计算当前年龄值(以PDV为单位)与PDV中之前的年龄值之间的差异。然后,使用sql程序,我们可以得到每个houseid的最大年龄差异。

+0

非常感谢。由此我只需要给每个房屋中最老的人添加一个虚拟值,因为这些人的age_diff是通过从前一个家庭的最小的人中减去他们的年龄来计算的,即在这个例子中,房屋3的人1的age_diff是计算为-32。这可能会导致错误,例如,如果房子2中最小的人年龄在80岁,那么age_diff会= 36,因此max(age_diff)将是36而不是正确的值32. – user2568648 2015-02-12 10:07:23

2

只是投入一个混合。这是Reeza回应的精简版本。

/* No need to sort by PersonID as age is the only concern */ 
proc sort data = houses; 
    by HouseID Age; 
run; 
data want; 
    set houses; 
    by HouseID; 
    /* Keep the diff when a new row is loaded */ 
    retain diff; 
    /* Only replace the diff if it is larger than previous */ 
    diff = max(diff, abs(dif(Age))); 
    /* Reset diff for each new house */ 
    if first.HouseID then diff = 0; 
    /* Only output the final diff for each house */ 
    if last.HouseID; 
    keep HouseID diff; 
run; 
0

下面是一个使用FIRST. and LAST.的例子,在数据中进行一次(排序后)。

data houses;    
input HouseID PersonID Age;  
datalines;    
1 1 25      
1 2 20     
2 1 32 
2 2 16 
2 3 14 
2 4 12 
3 1 44 
3 2 42 
3 3 10 
3 4 5 
; 
run; 

Proc sort data=HOUSES; 
by houseid descending age ; 
run; 

Data WANT(keep=houseid max_diff); 
format houseid max_diff; 
retain max_diff age1 age2; 
Set HOUSES; 

by houseid descending age ; 

if first.houseid and last.houseid then do; 
    max_diff=0; 
    output; 
end; 
else if first.houseid then do; 
    call missing(max_diff,age1,age2); 
    age1=age; 
end; 
else if not(first.houseid or last.houseid) then do; 
    age2=age; 
    temp=age1-age2; 
    if temp>max_diff then max_diff=temp; 
    age1=age; 
end; 
else if last.houseid then do; 
    age2=age; 
    temp=age1-age2; 
    if temp>max_diff then max_diff=temp; 
    output; 
end; 
Run; 
相关问题