首页 > 解决方案 > 从 SAS 中的交叉分类嵌套组创建新组 ID 的最有效方法是什么?

问题描述

我有两张桌子。第一个有大约 370,000 行,其中包含 2010 年至 2019 年间大学课程的数据。相关列是:yearprogram_idcollege_id。我保留了program_namecollege_name只是为了让它更容易理解。这是我的数据的样子:

| year | college_id | college_name         | program_id | program_name           |
|------|------------|----------------------|------------|------------------------|
| 2010 | 001        | Wilbor College       | 001        | Civil Engineering      |
| 2010 | 001        | Wilbor College       | 002        | Electrical Engineering |
| 2010 | 001        | Wilbor College       | 003        | Mechanical Engineering |
| 2010 | 001        | Wilbor College       | 021        | English                |
| 2010 | 002        | Mary College         | 031        | Physics                |
| 2010 | 003        | Francis College      | 041        | Arts                   |
| 2019 | 001        | Wilbor College       | 004        | Engineering            |
| 2019 | 101        | South Wilbor College | 022        | English Teaching       |
| 2019 | 101        | South Wilbor College | 023        | English and Spanish    |
| 2019 | 101        | South Wilbor College | 024        | English Literature     |
| 2019 | 223        | Mary&Francis College | 031        | Physics                |
| 2019 | 223        | Mary&Francis College | 032        | Astronomy              |
| 2019 | 223        | Mary&Francis College | 033        | Geophysics             |
| 2019 | 223        | Mary&Francis College | 034        | Biophysics             |
| 2019 | 223        | Mary&Francis College | 041        | Arts                   |
| 2019 | 223        | Mary&Francis College | 042        | Visual Arts            |
| 2019 | 223        | Mary&Francis College | 043        | History of Art         |
| 2019 | 223        | Mary&Francis College | 044        | Cinema                 |

第二个表有一个包含大约 15,000 行的元数据,其中包含有关链接程序的信息,以及两列program_idlinked_program_id. 这是它的外观:


| program_id | linked_program_id |
|------------|-------------------|
| 001        | 002               |
| 004        | 001               |
| 004        | 002               |
| 004        | 003               |
| 021        | 022               |
| 021        | 023               |
| 023        | 021               |
| 023        | 024               |
| 031        | 032               |
| 032        | 031               |
| 033        | 031               |
| 034        | 031               |
| 041        | 042               |
| 041        | 043               |
| 042        | 044               |

第二个表很重要,因为它允许识别相互链接的程序。有几种类型的链接,但简单地说,链接是值program_id可以随时间变化的情况。通常发生这些变化是因为多年来有些程序被拆分或合并。

大学也会发生这种情况(拆分和合并),但我没有大学的相同元数据。但是,如果我多年来可以跟踪他们的程序,就可以创建它。

我想要的是在第一个数据集中添加两个变量:

  1. group_p_idprogram_id:对于多年来以某种方式联系在一起的所有值的 id 都是相同的;
  2. group_c_idcollege_id:对于多年来以某种方式联系在一起的所有值的 id 都是相同的。

生成的数据集如下所示:

| year | col_id | college_name         | prog_id | program_name           | group_p_id | group_c_id |
|------|--------|----------------------|---------|------------------------|------------|------------|
| 2010 | 001    | Wilbor College       | 001     | Civil Engineering      | 004        | 001        |
| 2010 | 001    | Wilbor College       | 002     | Electrical Engineering | 004        | 001        |
| 2010 | 001    | Wilbor College       | 003     | Mechanical Engineering | 004        | 001        |
| 2010 | 001    | Wilbor College       | 021     | English                | 021        | 001        |
| 2010 | 002    | Mary College         | 031     | Physics                | 031        | 223        |
| 2010 | 003    | Francis College      | 041     | Arts                   | 041        | 223        |
| 2019 | 001    | Wilbor College       | 005     | Engineering            | 004        | 001        |
| 2019 | 101    | South Wilbor College | 022     | English Teaching       | 021        | 001        |
| 2019 | 101    | South Wilbor College | 023     | English and Spanish    | 021        | 001        |
| 2019 | 101    | South Wilbor College | 024     | English Literature     | 021        | 001        |
| 2019 | 223    | Mary&Francis College | 031     | Physics                | 031        | 223        |
| 2019 | 223    | Mary&Francis College | 032     | Astronomy              | 031        | 223        |
| 2019 | 223    | Mary&Francis College | 033     | Geophysics             | 031        | 223        |
| 2019 | 223    | Mary&Francis College | 034     | Biophysics             | 031        | 223        |
| 2019 | 223    | Mary&Francis College | 041     | Arts                   | 041        | 223        |
| 2019 | 223    | Mary&Francis College | 042     | Visual Arts            | 041        | 223        |
| 2019 | 223    | Mary&Francis College | 043     | History of Art         | 041        | 223        |
| 2019 | 223    | Mary&Francis College | 044     | Cinema                 | 041        | 223        |

以下是 SAS 输入格式的数据集(ID 采用数字格式以便于转换):

data have1;
INPUT year 1-4 college_id 6-8 college_name $10-29 program_id 31-33 program_name $35-56;
datalines;
2010 001 Wilbor College       001 Civil Engineering
2010 001 Wilbor College       002 Electrical Engineering
2010 001 Wilbor College       003 Mechanical Engineering
2010 001 Wilbor College       021 English
2010 002 Mary College         031 Physics
2010 003 Francis College      041 Arts
2019 001 Wilbor College       004 Engineering
2019 101 South Wilbor College 022 English Teaching
2019 101 South Wilbor College 023 English and_Spanish
2019 101 South Wilbor College 024 English Literature
2019 223 Mary&Francis College 031 Physics
2019 223 Mary&Francis College 032 Astronomy
2019 223 Mary&Francis_College 033 Geophysics
2019 223 Mary&Francis College 034 Biophysics
2019 223 Mary&Francis College 041 Arts
2019 223 Mary&Francis College 042 Visual Arts
2019 223 Mary&Francis College 043 History of Art
2019 223 Mary&Francis College 044 Cinema
;
run;

data have2;
INPUT program_id 1-3 linked_program_id 5-7;
datalines;
001 002
004 001
004 002
004 003
021 022
021 023
023 021
023 024
031 032
032 031
033 031
034 031
041 042
041 043
042 044
;
run;

data want;
INPUT year 1-4 college_id 6-8 college_name $10-29 program_id 31-33 program_name $35-56 broad_c_id 58-60 broad_c_id 62-64;
datalines;
2010 001 Wilbor College       001 Civil Engineering      004 001
2010 001 Wilbor College       002 Electrical Engineering 004 001
2010 001 Wilbor College       003 Mechanical Engineering 004 001
2010 001 Wilbor College       021 English                021 001
2010 002 Mary College         031 Physics                031 223
2010 003 Francis College      041 Arts                   041 223
2019 001 Wilbor College       005 Engineering            004 001
2019 101 South Wilbor College 022 English Teaching       021 001
2019 101 South Wilbor College 023 English and Spanish    021 001
2019 101 South Wilbor College 024 English Literature     021 001
2019 223 Mary&Francis College 031 Physics                031 223
2019 223 Mary&Francis College 032 Astronomy              031 223
2019 223 Mary&Francis College 033 Geophysics             031 223
2019 223 Mary&Francis College 034 Biophysics             031 223
2019 223 Mary&Francis College 041 Arts                   041 223
2019 223 Mary&Francis College 042 Visual Arts            041 223
2019 223 Mary&Francis College 043 History of Art         041 223
2019 223 Mary&Francis College 044 Cinema                 041 223
;
run;

棘手的是第二个表(元数据)组织得不是很好。一些 id以“多对多”的方式出现在列program_id中。linked_program_id

我试图自己解决这个问题,但是代码变得非常复杂,我不确定我是否做对了。所以这就是我所做的:

  1. 确保每个 id 都列在两列中(asprogram_idlinked_program_id):
proc sql;
create table temp_1 as
select distinct
program_id,
linked_program_id
from have2
union
select distinct
linked_program_id as program_id,
program_id as linked_program_id
from have2;
quit;
  1. 通过linked_program_id计算program_id的最大值,反之亦然,然后是每一行的最大值:
proc sql;
create table temp_2 as 
select
max(ID_0, ID_1) as ID_2,
*
from (select 
   max(linked_program_id) as ID_1, 
   *                 
   from (select 
        max(program_id) as ID_0, 
        *
        from temp_1
        group by linked_program_id) 
    group by program_id);
quit;

  1. 最后一步将相同的新 id 归因于大多数链接程序,但不是全部。因此,我在宏中重复了上一步的相同逻辑 3 次(将新 idid_2作为第一个输入):

(对于我在此处提供的示例数据集,运行一次循环就足够了。但是在我的 15,000 行数据集中,我必须运行它 3 次,直到我为所有链接程序获得相同的 id。)

%macro max_rep;
proc sql;
%do i=3 %to 9 %by 3; 
create table temp_%eval(&i.+2) as 
select
max(ID_&i., ID_%eval(&i.+1)) as ID_%eval(&i.+2),
program_id, linked_program_id
from (select 
    max(ID_&i.) as ID_%eval(&i.+1),
    *
    from (select 
        max(ID_%eval(&i.-1)) as ID_&i.,
        *
        from temp_%eval(&i.-1)
        group by linked_program_id) 
    group by program_id);

drop table temp_%eval(&i.-1);

%end;
quit;
%mend;
%max_rep;
  1. have1最后,我在and之间运行了一个连接temp_11
proc sql;
create table want as
select
t1.*,
t2.ID_11 as group_p_id
from have1 as t1 
left join (select distinct ID_11, program_id from temp_11) as t2
on t1.program_id=t2.program_id;
quit;

这给了我想要的一半。现在,对于属于同一组链接程序的每个程序,我都有一个唯一的 ID。我根据其他变量做了一些额外的更改,以获得每个链接组的更广泛的程序作为我的group_p_id(而不仅仅是最大值)。然后我意识到我仍然需要弄清楚如何为大学创建一个新的组 ID(group_c_id)。必须有一种更简单、更有效的方法是 SAS,不是吗?

任何帮助表示赞赏。

标签: sassas-macroproc-sql

解决方案


推荐阅读