首页 > 解决方案 > 按年龄将人员分配到不同家庭的代码

问题描述

我正在生成一个合成人口,其中每个家庭的每种规模和年龄构成的家庭数量都是已知的。我正在尝试按年龄将人员分配到这些家庭中的每一个。

每个年龄组的总人数(列总和)应总计为

Children   45196
Adult     148949
Senior     12195

而每个规模(1-20)的家庭总人数应为

1    2276
2    9366
3   23739
4   47636
5   42475
6   28338
7    3675
8    3728
9    3672
10   3830
11   3894
12   3792
13   3770
14   3710
15   3795
16   3648
17   3672
18   3744
19   3800
20   3780

我试图在 python 中将其编码为一组线性方程。但是,负面解决方案的存在无济于事。代码如下,如何修改生成不同家庭各年龄组的总人数。

# Total Population
Population = 206340

# Number of Children, Adults and Seniors
Demography = np.array([ 45196, 148949,  12195])

# Number of households by size
Household_size_distribution = np.array([2276, 4683, 7913,11909,8495,4723,525,466,408,383,354,316,290,265,253,228,216,208,200,189])

# Probability that a person of a certain age group belongs to a household of a certain size
Age_Composition = np.array([[7.000e-04, 7.702e-01, 2.291e-01],
       [1.890e-02, 8.066e-01, 1.745e-01],
       [1.486e-01, 8.027e-01, 4.870e-02],
       [2.519e-01, 7.180e-01, 3.010e-02],
       [2.732e-01, 6.719e-01, 5.490e-02],
       [3.046e-01, 6.337e-01, 6.170e-02]])

# Store Age compositions
x = np.zeros((20,3),dtype=np.float)
x[:5,:] = Age_Composition[:5,:]
# Age composition same for households with more than 6 persons
x[5:20,:] = np.repeat(Age_Composition[5][np.newaxis,:], 15,0)

# Normalize the age compositions column-wise: Children, Adults and Seniors
y = np.zeros((20,3),dtype=np.float)
y[:,0] = x[:20,0]/np.sum(x[:20,0])
y[:,1] = x[:20,1]/np.sum(x[:20,1])
y[:,2] = x[:20,2]/np.sum(x[:20,2])

# Store Coefficients of 60 variables
w = np.zeros((23,60),dtype=np.float)
w[:20,:3] = x

z = np.zeros((20,3),dtype=np.float)
z[:,0] = y[:,0] 
w[20] = np.reshape(z,(60))

z = np.zeros((20,3),dtype=np.float)
z[:,1] = y[:,1] 
w[21] = np.reshape(z,(60))

z = np.zeros((20,3),dtype=np.float)
z[:,2] = y[:,2] 
w[22] = np.reshape(z,(60))

rollnumber = np.arange(0,60,step=3)
for i in range(20):
  w[i]=np.roll(w[i],rollnumber[i]) 

# Ax=B  
A = w
B = np.zeros((23,1),dtype=np.int)
B[:20,:] = new_Household_size_distribution[:,None]*np.arange(1,21)[:,None]
B[20:,:] = Demography[:,None]

# Solution to the set of linear equations
f=np.matmul(np.linalg.pinv(A),B)
f=np.reshape(f,60)

标签: pythonnumpy

解决方案


由于您想根据分布生成合成数据,我认为最好的办法是使用一些随机方法,而不是尝试通过线性代数来解决它。

要生成的数据大小为206340,因此很容易放入内存中。

构造数据样本以使类别频率为 的一种简单方法i是将元素f[i]的属性设置为。我们可以使用 numpy as 轻松地做到这一点f[i]i

start = 0
for i, fi in enumerate(f):
  category[start:start+fi] = i
  start += fi

此外,频率对于排列是不变的。

所以我们能做的就是把人分到户,把n人分到大小户,nppl_by_household_size[i]人分到大小户i+1。然后我们将年龄组相应地分配给ppl_by_age_group,最后我们打乱 age_group 数组。

这将给出一个随机样本,其中该人i居住在年龄组age_group[i]中,并且居住在具有 index 的家庭中household_id[i],具有 size household_size[i],从这些数组中,您可以轻松地将数据转换为所需的格式。

ppl_by_age_group = [45196, 148949,  12195]
ppl_by_household_size = np.array([
    2276,9366,23739,47636,42475,28338,3675,3728,3672,3830,
    3894,3792,3770,3710,3795,3648,3672,3744,3800,3780])
# check if a solution exists
assert np.all(ppl_by_household_size % np.arange(1, len(ppl_by_household_size)+1)) == 0)
assert np.sum(ppl_by_household_size) == np.sum(ppl_by_age_group)
nppl = np.sum(ppl_by_age_group)
age_group = np.zeros(nppl, np.int8)
household_size = np.zeros(nppl, np.int8)
household_id = np.zeros(nppl, np.int32)
cppl = 0
nhouseholds = 0
for i,n in enumerate(ppl_by_household_size):
    # this will assign ppl_by_household_size[i] to 
    # ppl_by_holsehold_size[i] // (i + 1) households
    # so, each household will have (i + 1) members
    household_id[cppl:cppl+n] = nhouseholds + np.arange(n) // (i+1)
    household_size[cppl:cppl+n] = i+1
    cppl += n
    nhouseholds += n // (i+1)

cppl = 0
for i,n in enumerate(ppl_by_age_group):
    # this will assign ppl_by_household[i] members to
    # age group i
    age_group[cppl:cppl+n] = i
    cppl += n
# shuffle the age groups
age_group = np.random.permutation(age_group)

在这里它在 5 毫秒内运行

在现实世界中,我不希望年龄组和家庭规模是独立的,但这是另一个主题。


推荐阅读