首页 > 解决方案 > 我该如何解决这个问题:通过添加权重进行采样偏差校正

问题描述

如果我有一个数据集(抽样或来自调查),其中包含 400,000 个具有该人所属的人口统计类别(年龄、种族和教育水平)的人员 ID。前 30 行:


person id,age,education,ethnicity
0,75_84,Some College,white
1,85_120,HS Diploma,white
2,25_34,Some College,white
3,55_64,HS Diploma,black
4,45_54,Bachelor Degree,white
5,25_34,HS Diploma,white
6,55_64,Some College,white
7,45_54,HS Diploma,white
8,18_24,Some College,white
9,75_84,Some College,white
10,45_54,HS Diploma,black
11,55_64,Some College,white
12,55_64,Graduate Degree,white
13,55_64,Graduate Degree,black
14,18_24,Some College,white
15,25_34,Some College,white
16,25_34,Some College,white
17,45_54,HS Diploma,white
18,65_74,,white
19,55_64,HS Diploma,black
20,55_64,HS Diploma,black
21,55_64,HS Diploma,black
22,35_44,Some College,white
23,35_44,Some College,white
24,35_44,Some College,white
25,18_24,Some College,black
26,55_64,Some College,white
27,55_64,Some College,white
28,55_64,Bachelor Degree,white
29,55_64,Bachelor Degree,white
30,25_34,Bachelor Degree,white

通过使用 python,如何计算一组不偏向数据集的人级权重(每人一个权重)。每个类别的权重总和应该是您在演示地面实况数据集中拥有的权重。


演示地面实况数据集:

demographic category,number of individuals
18_24,11839159
25_34,16399632
35_44,15335704
45_54,16430762
55_64,15148777
65_74,9990412
75_84,5221430
0_4,7500407
5_9,7748669
10_14,7815759
15_17,4758751
85_120,2293226
< Than HS Diploma,12274025
Bachelor Degree,16305721
Graduate Degree,9343192
HS Diploma,25799018
Some College,28937146
asian,6145151
black,14626476
hispanic,21953456
islander,190389
white,73838168

标签: python

解决方案


answer = {'demographic category':[],
          'number of individuals':[],
          }

for k in df['demographic category'].unique():
    answer['demographic category'].append(k)
    answer['number of individuals'].append(df[df['demographic category']==k].shape[0])

for k in df.age.unique():
    answer['demographic category'].append(k)
    answer['number of individuals'].append(df[df.age==k].shape[0])

answer = pandas.DataFrame(answer)

推荐阅读