首页 > 技术文章 > joinby 命令多对多匹配

celine227 2021-07-13 11:28 原文

1. joinby 命令:多对多的匹配

*输入数据
clear
input group str3 x1
      1  "A"
      1  "B"
      1  "C"
      1  "D"
end
save file1.dta,replace

clear
input group str3 x2
      1  "M"
      1  "N"
end
save file2.dta,replace

*merge 多对多匹配

use file1.dta,clear
merge m:m group using file2.dta
list, clean noobs

*joinby 多对多匹配

use file1.dta,clear
joinby group using file2.dta
list, clean noobs

merge 命令多对多匹配结果:

group   x1   x2        _merge  
        1    A    M   matched (3)  
        1    B    N   matched (3)  
        1    C    N   matched (3)  
        1    D    N   matched (3)  

可以看出,merge 命令的多对多合并是有问题的,其会以较少数据文件的最后一行值 (比如这里的 file2.dta 的最后一行数据「group1,x2=N」) 进行重复合并。

joinby 命令多对多匹配结果:

group   x1   x2  
        1    A    N  
        1    A    M  
        1    B    M  
        1    B    N  
        1    C    N  
        1    C    M  
        1    D    N  
        1    D    M  

可以看出,joinby 命令显然更符合我们的要求。关于 joinby 命令更多详细介绍,请查看帮助文件 help joinby

2. nearmrg 命令:相似值的匹配

*生成一份数据

sysuse auto.dta, clear 
keep make price mpg
keep if make == "Toyota Celica" |    ///
        make == "BMW 320i" |         ///
        make == "Cad. Seville"  |    ///
        make == "Pont. Grand Prix" | ///
        make == "Datsun 210" 
rename make make2
save "using.dta", replace
list, clean noobs

列出数据:

make2               price   mpg  
Cad. Seville       15,906    21  
Pont. Grand Prix    5,222    19  
BMW 320i            9,735    25  
Datsun 210          4,589    35  
Toyota Celica       5,899    18  

然后,我们将该数据与 auto.dta 进行合并,并找出 using.dta 数据中价格浮动在 $50 上下的数据。

sysuse auto.dta, clear
nearmrg  using "using.dta", upper nearvar(price) genmatch(usingmatch) limit(50)
keep make price mpg make2 _merge usingmatch
list, clean noobs
make                price   mpg   make2                   _merge   usingm~h  
Datsun 210          4,589    35   Datsun 210         matched (3)      4,589  
Buick Regal         5,189    20   Pont. Grand Prix   matched (3)      5,222  
Pont. Grand Prix    5,222    19   Pont. Grand Prix   matched (3)      5,222  
Olds Cutl Supr      5,172    19   Pont. Grand Prix   matched (3)      5,222  
Dodge Magnum        5,886    16   Toyota Celica      matched (3)      5,899  
Toyota Celica       5,899    18   Toyota Celica      matched (3)      5,899  
BMW 320i            9,735    25   BMW 320i           matched (3)      9,735  
Audi 5000           9,690    17   BMW 320i           matched (3)      9,735  
Cad. Seville       15,906    21   Cad. Seville       matched (3)     15,906  

可以看出,using data 中原有 5 行数据,合并后变成了 9 行数据。之所以如此,是因为 auto.dta 中价格浮动在 50 之内的数据都被保留了下来。

推荐阅读