首页 > 解决方案 > 如何创建“指数平滑”变量(难)

问题描述

我有一个带有 ID 的数据框,以及这些 ID 所做的选择。每个选择都与某个城市有关。选择集是整数列表:[10, 20, 30, 40, 50, 60],城市集是字符串列表['XX', 'YY', 'ZZ']。注意:一个或多个选项可能与同一个城市有关。例如:选择2030与城市有关'YY'

这是数据框:

 ID  choice city
  1      10   XX
  1      10   XX
  1      20   YY
  1      10   XX
  1      30   YY
  1      40   ZZ
  2      20   YY
  2      50   ZZ
  2      50   ZZ
  2      50   ZZ
  2      10   XX
  3      30   YY
  3      30   YY
  3      60   ZZ
  3      60   ZZ
  3      60   ZZ
  3      10   XX

这是选择城市数据框:

 choice city
     10   XX
     20   YY
     30   YY
     40   ZZ
     50   ZZ
     60   ZZ

另一个数据框告诉我们每个城市有多少选择:

city  count
  XX      1
  YY      2
  ZZ      3

我想为每个选择创建一个变量:'10_Var', '20_Var', '30_Var', '40_Var', '50_Var', '60_Var'. 在每个 ID的第一行,if第一个选择'XX'例如与城市相关,因此变量 '10_Var' 将获取值0.8 / # of choices that related to this city(0.8 是某个参数),以及与同一城市无关的其他每个变量将获得价值(1 - 0.8) / (# of choices - # of choices that related to the city 'XX')

完成上述步骤后的数据应该如何:

 ID  choice city  10_Var  20_Var  30_Var  40_Var  50_Var  60_Var
  1      10   XX    0.80    0.04    0.04    0.04    0.04    0.04
  1      10   XX     NaN     NaN     NaN     NaN     NaN     NaN
  1      20   YY     NaN     NaN     NaN     NaN     NaN     NaN
  1      10   XX     NaN     NaN     NaN     NaN     NaN     NaN
  1      30   YY     NaN     NaN     NaN     NaN     NaN     NaN
  1      40   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  2      20   YY    0.05    0.40    0.40    0.05    0.05    0.05
  2      50   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  2      50   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  2      50   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  2      10   XX     NaN     NaN     NaN     NaN     NaN     NaN
  3      30   YY    0.05    0.40    0.40    0.05    0.05    0.05
  3      30   YY     NaN     NaN     NaN     NaN     NaN     NaN
  3      60   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  3      60   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  3      60   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  3      10   XX     NaN     NaN     NaN     NaN     NaN     NaN

从第二行以此类推(对于每个 ID),'10_Var'例如,变量将获得值:(0.8 * Previous-value)+ (1 - 0.8) * {1 if the **last** choice is related to the city 'XX', 0 otherwise} / # of choices that related to the city 'XX',对于每个变量,依此类推。

注意:应该为每个 ID 完成。

预期结果:

 ID  choice city    10_Var    20_Var    30_Var    40_Var    50_Var    60_Var
  1      10   XX  0.800000  0.040000  0.040000  0.040000  0.040000  0.040000
  1      10   XX  0.840000  0.032000  0.032000  0.032000  0.032000  0.032000
  1      20   YY  0.872000  0.025600  0.025600  0.025600  0.025600  0.025600
  1      10   XX  0.697600  0.120480  0.120480  0.020480  0.020480  0.020480
  1      30   YY  0.758080  0.096384  0.096384  0.016384  0.016384  0.016384
  1      40   ZZ  0.606464  0.177107  0.177107  0.013107  0.013107  0.013107
  2      20   YY  0.050000  0.400000  0.400000  0.050000  0.050000  0.050000
  2      50   ZZ  0.040000  0.420000  0.420000  0.040000  0.040000  0.040000
  2      50   ZZ  0.032000  0.336000  0.336000  0.098667  0.098667  0.098667
  2      50   ZZ  0.025600  0.268800  0.268800  0.145600  0.145600  0.145600
  2      10   XX  0.020480  0.215040  0.215040  0.183147  0.183147  0.183147
  3      30   YY  0.050000  0.400000  0.400000  0.050000  0.050000  0.050000
  3      30   YY  0.040000  0.420000  0.420000  0.040000  0.040000  0.040000
  3      60   ZZ  0.032000  0.436000  0.436000  0.032000  0.032000  0.032000
  3      60   ZZ  0.025600  0.348800  0.348800  0.092267  0.092267  0.092267
  3      60   ZZ  0.020480  0.279040  0.279040  0.140480  0.140480  0.140480
  3      10   XX  0.016384  0.223232  0.223232  0.179051  0.179051  0.179051

这个问题可能会有所帮助: 创建“指数平滑”变量 - Pandas

标签: pythonpandas

解决方案


这是一个可能的解决方案:

import numpy as np
import pandas as pd

# Parameter
P = 0.8

def exp_smooth(g):
    city = g.iloc[0].City
    rows = [np.where(cities == city,
                     P/cic[city],
                     (1-P)/(len(choices)-cic[city]))]
    for i in range(len(g) - 1):
        city = g.iloc[i].City
        rows.append(rows[-1]*P+(1-P)*np.where(cities == city, 1, 0)/cic[city])
    return np.array(rows)

df = pd.DataFrame([[1, 10, "XX"], [1, 10, "XX"], [1, 20, "YY"], [1, 10, "XX"],
                   [1, 30, "YY"], [1, 40, "ZZ"], [2, 20, "YY"], [2, 50, "ZZ"],
                   [2, 50, "ZZ"], [2, 50, "ZZ"], [2, 10, "XX"], [3, 30, "YY"],
                   [3, 30, "YY"], [3, 60, "ZZ"], [3, 60, "ZZ"], [3, 60, "ZZ"],
                   [3, 10, "XX"]],
                  columns=("ID", "Choice", "City"))
chc = {10: "XX", 20: "YY", 30: "YY", 40: "ZZ", 50: "ZZ", 60: "ZZ"}
cic = {"XX": 1, "YY": 2, "ZZ": 3}
choices = np.unique(df.Choice)
cities = np.vectorize(lambda ch: chc[ch])(choices)

var_arr = np.concatenate([exp_smooth(g) for _, g in df.groupby("ID")], axis=0)
var_df = pd.DataFrame(var_arr, columns=[f"var_{c}" for c in choices])
df = pd.concat([df, var_df], axis=1)

df包含预期结果:

    ID  Choice City    var_10    var_20    var_30    var_40    var_50    var_60
0    1      10   XX  0.800000  0.040000  0.040000  0.040000  0.040000  0.040000
1    1      10   XX  0.840000  0.032000  0.032000  0.032000  0.032000  0.032000
2    1      20   YY  0.872000  0.025600  0.025600  0.025600  0.025600  0.025600
3    1      10   XX  0.697600  0.120480  0.120480  0.020480  0.020480  0.020480
4    1      30   YY  0.758080  0.096384  0.096384  0.016384  0.016384  0.016384
5    1      40   ZZ  0.606464  0.177107  0.177107  0.013107  0.013107  0.013107
6    2      20   YY  0.050000  0.400000  0.050000  0.050000  0.050000  0.050000
7    2      50   ZZ  0.040000  0.420000  0.140000  0.040000  0.040000  0.040000
8    2      50   ZZ  0.032000  0.336000  0.112000  0.098667  0.098667  0.098667
9    2      50   ZZ  0.025600  0.268800  0.089600  0.145600  0.145600  0.145600
10   2      10   XX  0.020480  0.215040  0.071680  0.183147  0.183147  0.183147
11   3      30   YY  0.050000  0.050000  0.400000  0.050000  0.050000  0.050000
12   3      30   YY  0.040000  0.140000  0.420000  0.040000  0.040000  0.040000
13   3      60   ZZ  0.032000  0.212000  0.436000  0.032000  0.032000  0.032000
14   3      60   ZZ  0.025600  0.169600  0.348800  0.092267  0.092267  0.092267
15   3      60   ZZ  0.020480  0.135680  0.279040  0.140480  0.140480  0.140480
16   3      10   XX  0.016384  0.108544  0.223232  0.179051  0.179051  0.179051

推荐阅读