首页 > 解决方案 > 如何从给定的计数、平均值、标准差、最小值、最大值等生成数据集?

问题描述

我有在 pandas DataFrame.describe() 方法中找到的所有统计细节,如计数、平均值、标准差、最小值、最大值等。我需要从这些细节中生成数据集。是否有任何应用程序或 python 代码可以完成这项工作。我想生成任何具有这些统计信息的随机数据集

计数 263
平均值 35.790875
标准 24.874763
最小值 0.0000000
25% 16.000000
50% 32.000000
75% 49.000000
最大值 99.000000

标签: pythonpandasdataset

解决方案


您好,欢迎到论坛!这是一个很好的问题,我喜欢它。

我认为在一般情况下这是不平凡的。您可以创建一个具有正确计数、平均值、最小值和百分位数的数据集,但标准差相当棘手。

这是获取满足您的示例要求的数据集的一种方法。它可以适用于一般情况,但预计会有许多“边界情况”。基本思想是满足从最简单到最难的每个要求,注意在前进时不要使以前的要求无效。

from numpy import std
import math

COUNT = 263
MEAN = 35.790875
STD = 24.874763
MIN = 0
P25 = 16
P50 = 32
P75 = 49
MAX = 99

#Positions of the percentiles
P25_pos = floor(0.25 * COUNT) - 1
P50_pos = floor(0.5 * COUNT) - 1
P75_pos = floor(0.75 * COUNT) - 1
MAX_pos = COUNT -1

#Count requirement
v = [0] * COUNT

#Min requirement
v[0] = MIN

#Max requirement
v[MAX_pos] = MAX

#Good, we already satisfied the easiest 3 requirements. Notice that these are deterministic,
#there is only one way to satisfy them

#This will satisfy the 25th percentile requirement
for i in range(1, P25_pos):
    #We could also interpolate the value from P25 to P50, even adding a bit of randomness.
    v[i] = P25
v[P25_pos] = P25

#Actually pandas does some linear interpolation (https://stackoverflow.com/questions/39581893/pandas-find-percentile-stats-of-a-given-column)
#when calculating percentiles but we can simulate that by letting the next value be also P25
if P25_pos + 1 != P50_pos:
    v[P25_pos + 1] = P25

#We do something extremely similar with the other percentiles
for i in range(P25_pos + 3, P50_pos):
    v[i] = P50

v[P50_pos] = P50
if P50_pos + 1 != P75_pos:
    v[P50_pos + 1] = P50

for i in range(P50_pos + 1, P75_pos):
    v[i] = P50

v[P75_pos] = P75
if P75_pos + 1 != v[MAX_pos]:
    v[P75_pos + 1] = P75

for i in range(P75_pos + 1, MAX_pos):
    v[i] = P75

#This will give us correct 25%, 50%, 75%, min, max, and count values. We are still missing MEAN and std.

#We are getting a mean of 24.84, and we need to increase it a little bit to get 35.790875. So we manually teak the numbers between the 75th and 100th percentile.
#That is, numbers between pos 197 and 261.
#This would be much harder to do automatically instead of with a hardcoded example.

#This increases the average a bit, but not enough!
for i in range(P75_pos + 1, 215):
    v[i] = MAX


#We solve an equation to get the necessary value for v[256] for the mean to be what we want to be.
#This equation comes from the formula for the average: AVG = SUM/COUNT. We simply clear the variable v[215] from that formula.
new_value = MEAN * COUNT - sum(v) + v[215]

#The new value for v[215] should be between P75 and MAX so we don't invalidate the percentiles.
assert(P75 <= new_value)
assert(new_value <= MAX)

v[256] = new_value


#Now comes the tricky part: we need the correct std. As of now, it is 20.916364, and it should be higher: 24.874763
#For this, as we don't want to change the average, we are going to change values in pairs,
#as we need to compensate each absolute increase with an absolute decrease

for i in range(1, P25_pos - 3):
    #We can move the values between the 0th and 25th percentile between 0 and 16
    v[i] -= 12

    #Between the 25th and 50th percentile, we can move the values between 32 and 49
    v[P25_pos + 1 + i] += 12


#As of now, this got us a std of 24.258115. We need it to be a bit higher: 24.874763

#The trick we did before of imposing a value for getting the correct mean is much harder to do here,
#because the equation is much more complicated

#So we'll just approximate the value intead with a while loop. There are faster ways than this, see: https://en.wikipedia.org/wiki/Root-finding_algorithms
current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))
while 24.874763 - current_std >= 10e-5:
    for i in range(1, P25_pos - 3):
        #We can move the values between the 0th and 25th percentile between 0 and 16
        v[i] -= 0.00001

        #Between the 25th and 50th percentile, we can move the values between 32 and 49
        v[P25_pos + 1 + i] += 0.00001
    current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))

#We tweak some further decimal points now
while 24.874763 - current_std >= 10e-9:
    v[1] += 0.0001

    #Between the 25th and 50th percentile, we can move the values between 32 and 49
    v[P25_pos + 2] -= 0.0001
    current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))


df = pd.DataFrame({'col':v})

#Voila!
df.describe()

输出:

    col
count   263.000000
mean    35.790875
std     24.874763
min     0.000000
25%     16.000000
50%     32.000000
75%     49.000000
max     99.000000

推荐阅读