python - 如何从给定的计数、平均值、标准差、最小值、最大值等生成数据集?
问题描述
我有在 pandas DataFrame.describe() 方法中找到的所有统计细节,如计数、平均值、标准差、最小值、最大值等。我需要从这些细节中生成数据集。是否有任何应用程序或 python 代码可以完成这项工作。我想生成任何具有这些统计信息的随机数据集
计数 263
平均值 35.790875
标准 24.874763
最小值 0.0000000
25% 16.000000
50% 32.000000
75% 49.000000
最大值 99.000000
解决方案
您好,欢迎到论坛!这是一个很好的问题,我喜欢它。
我认为在一般情况下这是不平凡的。您可以创建一个具有正确计数、平均值、最小值和百分位数的数据集,但标准差相当棘手。
这是获取满足您的示例要求的数据集的一种方法。它可以适用于一般情况,但预计会有许多“边界情况”。基本思想是满足从最简单到最难的每个要求,注意在前进时不要使以前的要求无效。
from numpy import std
import math
COUNT = 263
MEAN = 35.790875
STD = 24.874763
MIN = 0
P25 = 16
P50 = 32
P75 = 49
MAX = 99
#Positions of the percentiles
P25_pos = floor(0.25 * COUNT) - 1
P50_pos = floor(0.5 * COUNT) - 1
P75_pos = floor(0.75 * COUNT) - 1
MAX_pos = COUNT -1
#Count requirement
v = [0] * COUNT
#Min requirement
v[0] = MIN
#Max requirement
v[MAX_pos] = MAX
#Good, we already satisfied the easiest 3 requirements. Notice that these are deterministic,
#there is only one way to satisfy them
#This will satisfy the 25th percentile requirement
for i in range(1, P25_pos):
#We could also interpolate the value from P25 to P50, even adding a bit of randomness.
v[i] = P25
v[P25_pos] = P25
#Actually pandas does some linear interpolation (https://stackoverflow.com/questions/39581893/pandas-find-percentile-stats-of-a-given-column)
#when calculating percentiles but we can simulate that by letting the next value be also P25
if P25_pos + 1 != P50_pos:
v[P25_pos + 1] = P25
#We do something extremely similar with the other percentiles
for i in range(P25_pos + 3, P50_pos):
v[i] = P50
v[P50_pos] = P50
if P50_pos + 1 != P75_pos:
v[P50_pos + 1] = P50
for i in range(P50_pos + 1, P75_pos):
v[i] = P50
v[P75_pos] = P75
if P75_pos + 1 != v[MAX_pos]:
v[P75_pos + 1] = P75
for i in range(P75_pos + 1, MAX_pos):
v[i] = P75
#This will give us correct 25%, 50%, 75%, min, max, and count values. We are still missing MEAN and std.
#We are getting a mean of 24.84, and we need to increase it a little bit to get 35.790875. So we manually teak the numbers between the 75th and 100th percentile.
#That is, numbers between pos 197 and 261.
#This would be much harder to do automatically instead of with a hardcoded example.
#This increases the average a bit, but not enough!
for i in range(P75_pos + 1, 215):
v[i] = MAX
#We solve an equation to get the necessary value for v[256] for the mean to be what we want to be.
#This equation comes from the formula for the average: AVG = SUM/COUNT. We simply clear the variable v[215] from that formula.
new_value = MEAN * COUNT - sum(v) + v[215]
#The new value for v[215] should be between P75 and MAX so we don't invalidate the percentiles.
assert(P75 <= new_value)
assert(new_value <= MAX)
v[256] = new_value
#Now comes the tricky part: we need the correct std. As of now, it is 20.916364, and it should be higher: 24.874763
#For this, as we don't want to change the average, we are going to change values in pairs,
#as we need to compensate each absolute increase with an absolute decrease
for i in range(1, P25_pos - 3):
#We can move the values between the 0th and 25th percentile between 0 and 16
v[i] -= 12
#Between the 25th and 50th percentile, we can move the values between 32 and 49
v[P25_pos + 1 + i] += 12
#As of now, this got us a std of 24.258115. We need it to be a bit higher: 24.874763
#The trick we did before of imposing a value for getting the correct mean is much harder to do here,
#because the equation is much more complicated
#So we'll just approximate the value intead with a while loop. There are faster ways than this, see: https://en.wikipedia.org/wiki/Root-finding_algorithms
current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))
while 24.874763 - current_std >= 10e-5:
for i in range(1, P25_pos - 3):
#We can move the values between the 0th and 25th percentile between 0 and 16
v[i] -= 0.00001
#Between the 25th and 50th percentile, we can move the values between 32 and 49
v[P25_pos + 1 + i] += 0.00001
current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))
#We tweak some further decimal points now
while 24.874763 - current_std >= 10e-9:
v[1] += 0.0001
#Between the 25th and 50th percentile, we can move the values between 32 and 49
v[P25_pos + 2] -= 0.0001
current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))
df = pd.DataFrame({'col':v})
#Voila!
df.describe()
输出:
col
count 263.000000
mean 35.790875
std 24.874763
min 0.000000
25% 16.000000
50% 32.000000
75% 49.000000
max 99.000000
推荐阅读
- java - 在springboot REST调用上将HttpSession保存在redis中
- apache-camel - Apache Camel doTry doCatch 没有捕获 akka 骆驼异常
- excel - 选择下一个唯一值
- c# - 实体数据源向导版本只兼容Entity Framework 5
- java - 如何使用 public static int main (String args []) ?在爪哇
- macos - 在 MacOS 上清理 .gradle forlde
- sql-server - 在nodejs中使用Service Broker从SQL Server发出应用程序事件之类的对象
- python-3.x - Tornado + aioredis:为什么我的 redis 调用会阻塞?
- ios - FCM iOS 设备令牌在 aws sns 推送通知中不起作用
- angular - 具有不同返回类型的通用数据服务