首页 > 解决方案 > 我如何选择关于重复的前 5 个观察值?

问题描述

我有一个包含超过 80 000 000 行的大型数据集,按“名称”和“收入”排序(名称和收入都有重复)。对于名字,我想有 5 个最低收入。对于第二个名字,我希望有 5 个最低收入(但收入被吸引到第一个名字的人将被取消资格被选中)。以此类推,直到姓氏(如果当时还有收入)。

标签: sas

解决方案


您需要对收入进行分类和跟踪

  • 使用 aarray排序和跟踪incomea的最低五个name
  • 使用 ahash来跟踪和检查一个存在的遵守情况incomeoutput因此没有资格输出后来的名字。

例子:

使用了一种插入式符合条件的低价值收入,并且由于只有 5 个项目,因此速度很快。

data have;
  call streaminit(1234);
  do name = 1 to 1e6;
    do seq = 1 to rand('integer', 20);
      income = rand('integer', 20000, 1000000);
      output;
    end;
  end;
run;

data
  want (label='Lowest 5 incomes (first occurring over all names) of each name')
  want_barren(keep=name label='Names whose all incomes were previously output for earlier names')
;
  array X(5) _temporary_;

  if _n_ = 1 then do;
    if 0 then set have;
    declare hash incomes();
    incomes.defineKey('income');
    incomes.defineDone();
  end;

  _maxmin5 = 1e15;
  x(1) = 1e15;
  x(2) = 1e15;
  x(3) = 1e15;
  x(4) = 1e15;
  x(5) = 1e15;

  do _n_ = 1 by 1 until (last.name);
    set have;
    by name;

    if incomes.check() = 0 then continue;

    * insert sort - lowest five not observed previously;

    if income > _maxmin5 then continue;

    do _i_ = 1 to 5;
      if income < x(_i_) then do;
        do _j_ = 5 to _i_+1 by -1;
          x(_j_) = x(_j_-1);
        end;
        x(_i_) = income;
        _maxmin5 = x(5);
        incomes.add();
        leave;
      end;
    end;
  end;

  _outflag = 0;
  do _n_ = 1 to _n_;
    set have;

    if income in x then do;
      _outflag = 1;
      OUTPUT want;
    end;
  end;

  if not _outflag then 
    OUTPUT want_barren;

  drop _:;
run;

推荐阅读