python - 如何创建“指数平滑”变量(难)
问题描述
我有一个带有 ID 的数据框,以及这些 ID 所做的选择。每个选择都与某个城市有关。选择集是整数列表:[10, 20, 30, 40, 50, 60]
,城市集是字符串列表['XX', 'YY', 'ZZ']
。注意:一个或多个选项可能与同一个城市有关。例如:选择20
和30
与城市有关'YY'
。
这是数据框:
ID choice city
1 10 XX
1 10 XX
1 20 YY
1 10 XX
1 30 YY
1 40 ZZ
2 20 YY
2 50 ZZ
2 50 ZZ
2 50 ZZ
2 10 XX
3 30 YY
3 30 YY
3 60 ZZ
3 60 ZZ
3 60 ZZ
3 10 XX
这是选择城市数据框:
choice city
10 XX
20 YY
30 YY
40 ZZ
50 ZZ
60 ZZ
另一个数据框告诉我们每个城市有多少选择:
city count
XX 1
YY 2
ZZ 3
我想为每个选择创建一个变量:'10_Var', '20_Var', '30_Var', '40_Var', '50_Var', '60_Var'
. 在每个 ID的第一行,if
第一个选择'XX'
例如与城市相关,因此变量 '10_Var' 将获取值0.8 / # of choices that related to this city
(0.8 是某个参数),以及与同一城市无关的其他每个变量将获得价值(1 - 0.8) / (# of choices - # of choices that related to the city 'XX')
。
完成上述步骤后的数据应该如何:
ID choice city 10_Var 20_Var 30_Var 40_Var 50_Var 60_Var
1 10 XX 0.80 0.04 0.04 0.04 0.04 0.04
1 10 XX NaN NaN NaN NaN NaN NaN
1 20 YY NaN NaN NaN NaN NaN NaN
1 10 XX NaN NaN NaN NaN NaN NaN
1 30 YY NaN NaN NaN NaN NaN NaN
1 40 ZZ NaN NaN NaN NaN NaN NaN
2 20 YY 0.05 0.40 0.40 0.05 0.05 0.05
2 50 ZZ NaN NaN NaN NaN NaN NaN
2 50 ZZ NaN NaN NaN NaN NaN NaN
2 50 ZZ NaN NaN NaN NaN NaN NaN
2 10 XX NaN NaN NaN NaN NaN NaN
3 30 YY 0.05 0.40 0.40 0.05 0.05 0.05
3 30 YY NaN NaN NaN NaN NaN NaN
3 60 ZZ NaN NaN NaN NaN NaN NaN
3 60 ZZ NaN NaN NaN NaN NaN NaN
3 60 ZZ NaN NaN NaN NaN NaN NaN
3 10 XX NaN NaN NaN NaN NaN NaN
从第二行以此类推(对于每个 ID),'10_Var'
例如,变量将获得值:(0.8 * Previous-value)
+ (1 - 0.8) * {1 if the **last** choice is related to the city 'XX', 0 otherwise} / # of choices that related to the city 'XX'
,对于每个变量,依此类推。
注意:应该为每个 ID 完成。
预期结果:
ID choice city 10_Var 20_Var 30_Var 40_Var 50_Var 60_Var
1 10 XX 0.800000 0.040000 0.040000 0.040000 0.040000 0.040000
1 10 XX 0.840000 0.032000 0.032000 0.032000 0.032000 0.032000
1 20 YY 0.872000 0.025600 0.025600 0.025600 0.025600 0.025600
1 10 XX 0.697600 0.120480 0.120480 0.020480 0.020480 0.020480
1 30 YY 0.758080 0.096384 0.096384 0.016384 0.016384 0.016384
1 40 ZZ 0.606464 0.177107 0.177107 0.013107 0.013107 0.013107
2 20 YY 0.050000 0.400000 0.400000 0.050000 0.050000 0.050000
2 50 ZZ 0.040000 0.420000 0.420000 0.040000 0.040000 0.040000
2 50 ZZ 0.032000 0.336000 0.336000 0.098667 0.098667 0.098667
2 50 ZZ 0.025600 0.268800 0.268800 0.145600 0.145600 0.145600
2 10 XX 0.020480 0.215040 0.215040 0.183147 0.183147 0.183147
3 30 YY 0.050000 0.400000 0.400000 0.050000 0.050000 0.050000
3 30 YY 0.040000 0.420000 0.420000 0.040000 0.040000 0.040000
3 60 ZZ 0.032000 0.436000 0.436000 0.032000 0.032000 0.032000
3 60 ZZ 0.025600 0.348800 0.348800 0.092267 0.092267 0.092267
3 60 ZZ 0.020480 0.279040 0.279040 0.140480 0.140480 0.140480
3 10 XX 0.016384 0.223232 0.223232 0.179051 0.179051 0.179051
这个问题可能会有所帮助: 创建“指数平滑”变量 - Pandas
解决方案
这是一个可能的解决方案:
import numpy as np
import pandas as pd
# Parameter
P = 0.8
def exp_smooth(g):
city = g.iloc[0].City
rows = [np.where(cities == city,
P/cic[city],
(1-P)/(len(choices)-cic[city]))]
for i in range(len(g) - 1):
city = g.iloc[i].City
rows.append(rows[-1]*P+(1-P)*np.where(cities == city, 1, 0)/cic[city])
return np.array(rows)
df = pd.DataFrame([[1, 10, "XX"], [1, 10, "XX"], [1, 20, "YY"], [1, 10, "XX"],
[1, 30, "YY"], [1, 40, "ZZ"], [2, 20, "YY"], [2, 50, "ZZ"],
[2, 50, "ZZ"], [2, 50, "ZZ"], [2, 10, "XX"], [3, 30, "YY"],
[3, 30, "YY"], [3, 60, "ZZ"], [3, 60, "ZZ"], [3, 60, "ZZ"],
[3, 10, "XX"]],
columns=("ID", "Choice", "City"))
chc = {10: "XX", 20: "YY", 30: "YY", 40: "ZZ", 50: "ZZ", 60: "ZZ"}
cic = {"XX": 1, "YY": 2, "ZZ": 3}
choices = np.unique(df.Choice)
cities = np.vectorize(lambda ch: chc[ch])(choices)
var_arr = np.concatenate([exp_smooth(g) for _, g in df.groupby("ID")], axis=0)
var_df = pd.DataFrame(var_arr, columns=[f"var_{c}" for c in choices])
df = pd.concat([df, var_df], axis=1)
df
包含预期结果:
ID Choice City var_10 var_20 var_30 var_40 var_50 var_60
0 1 10 XX 0.800000 0.040000 0.040000 0.040000 0.040000 0.040000
1 1 10 XX 0.840000 0.032000 0.032000 0.032000 0.032000 0.032000
2 1 20 YY 0.872000 0.025600 0.025600 0.025600 0.025600 0.025600
3 1 10 XX 0.697600 0.120480 0.120480 0.020480 0.020480 0.020480
4 1 30 YY 0.758080 0.096384 0.096384 0.016384 0.016384 0.016384
5 1 40 ZZ 0.606464 0.177107 0.177107 0.013107 0.013107 0.013107
6 2 20 YY 0.050000 0.400000 0.050000 0.050000 0.050000 0.050000
7 2 50 ZZ 0.040000 0.420000 0.140000 0.040000 0.040000 0.040000
8 2 50 ZZ 0.032000 0.336000 0.112000 0.098667 0.098667 0.098667
9 2 50 ZZ 0.025600 0.268800 0.089600 0.145600 0.145600 0.145600
10 2 10 XX 0.020480 0.215040 0.071680 0.183147 0.183147 0.183147
11 3 30 YY 0.050000 0.050000 0.400000 0.050000 0.050000 0.050000
12 3 30 YY 0.040000 0.140000 0.420000 0.040000 0.040000 0.040000
13 3 60 ZZ 0.032000 0.212000 0.436000 0.032000 0.032000 0.032000
14 3 60 ZZ 0.025600 0.169600 0.348800 0.092267 0.092267 0.092267
15 3 60 ZZ 0.020480 0.135680 0.279040 0.140480 0.140480 0.140480
16 3 10 XX 0.016384 0.108544 0.223232 0.179051 0.179051 0.179051
推荐阅读
- r - 使用 y 轴值在 ggplot2 中创建辅助 x 轴
- c# - 通过脚本更改导入的图像设置
- c# - 在 IIS 上托管时拆分字符串会引发 InvalidOperationException
- rust - 如何使用 ioctl + nix 宏来获取可变大小的缓冲区
- python - 查找列表中缺失的元素
- ruby-on-rails-4 - 应用程序未加载:“本地主机没有发送任何数据。ERR_EMPTY_RESPONSE”
- arduino - 如何防止其他电话/手机号码向 Arduino 发送 SMS 命令?
- python - 在 python 中使用参数从不同的文件夹运行脚本(如在 cmd 中)
- java - 仅当用户输入关键字时,如何使计数器加 1
- javascript - 在 AngularJS 中异步包含 Adobe DTM