首页 > 解决方案 > 从同一总体中抽取的随机样本交集的平均大小

问题描述

假设我们有一个装有 N 个球的瓮,我们从这个瓮中抽取几个随机大小的随机样本(我们在每次采样后替换瓮中的球,但每个样本都是在没有替换的情况下抽取的)。

在删除至少在 2 个样本中的元素后,我需要计算每个样本的平均大小。

例如,如果 N = 2 并且我们有一个包含 1 个元素的样本和一个包含 2 个元素的样本,则删除交集后的平均大小将是第一个样本的 0 和最后一个样本的 1。

如果 N = 3,第一个样本有 1 个元素,第二个样本有 2 个元素,我认为第一个样本中的元素有 2/3 的机会出现在其他样本中,所以第一个样本的大小将是 1/3,并且第二个样本的大小 1 + 1/3 = 4/3。

我正在努力寻找一个公式来在任何情况下计算它,我想它可以通过组合数量和样本大小来完成。

对于 N(小于 100)、样本大小(小于 10)和样本数量(2、3 或 4),我的值都很小...

我可以通过一些蒙特卡罗模拟(见下面的代码)轻松地近似这一点,但直接应用正确的公式会更快。

也许使用一些计算我想要的近似值的python代码会更容易理解:

from random import sample, randint


def simulations(population_size, samples_sizes, iterations_count=10000):
    population = range(population_size)
    samples_count = len(samples_sizes)
    average_intersection_size = [0 for _ in range(samples_count)]
    for iteration in range(1, iterations_count + 1):  # start from 1
        # generate random samples
        samples = []
        for sample_index in range(samples_count):
            samples.append(sample(population, samples_sizes[sample_index]))
        # count items overlapping
        for sample_index in range(samples_count):
            # retrieve intersection size with the union of other samples
            union_of_others = set()
            for other_sample_index in range(samples_count):
                if other_sample_index == sample_index:
                    # we skip current sample
                    continue
                union_of_others |= set(samples[other_sample_index])
            n = len(set(samples[sample_index]) & union_of_others)
            # incremental mean...
            delta = n - average_intersection_size[sample_index]
            average_intersection_size[sample_index] += delta / iteration
    # output results
    print(f'population size is {population_size}')
    for sample_index in range(samples_count):
        print(
f'sample {sample_index + 1}: original_size={samples_sizes[sample_index]}, new_size={samples_sizes[sample_index] - average_intersection_size[sample_index]}')


population_size = 12
samples_count = randint(2, 4)
samples_sizes = [randint(1, population_size) for _ in range(samples_count)]

simulations(population_size, samples_sizes)

"""Example outputs

population size is 12
sample 1: original_size=10, new_size=1.859099999999989
sample 2: original_size=4, new_size=0.18249999999999833
sample 3: original_size=7, new_size=0.51390000000002
sample 4: original_size=4, new_size=0.1847000000000114

population size is 12
sample 1: original_size=1, new_size=0.4761000000000001
sample 2: original_size=5, new_size=3.821899999999997
sample 3: original_size=2, new_size=1.0762999999999967

population size is 12
sample 1: original_size=4, new_size=0.0
sample 2: original_size=4, new_size=0.0
sample 3: original_size=6, new_size=0.0
sample 4: original_size=12, new_size=2.6712000000000167

population size is 12
sample 1: original_size=8, new_size=7.335500000000002
sample 2: original_size=1, new_size=0.33550000000000246

population size is 12
sample 1: original_size=9, new_size=0.7495999999999565
sample 2: original_size=11, new_size=2.7495999999999565
"""

我的目标是避免模拟并直接应用确切的公式。

标签: python-3.xmathcombinationsprobabilityintersection

解决方案


推荐阅读