首页 > 解决方案 > featuretools:使用时间戳累积 unique_value groupby 用户

问题描述

我有这样的数据集,

    user_id event_name  event_timestamp             origin


0   1001790 deals       2020-01-01 12:07:05.089002  
1   1001818 purchase    2019-10-30 09:15:38.810000  ICN
2   1001969 deals       2019-12-16 01:11:06.595004  
3   1001969 deals       2019-12-16 01:11:22.811008  
4   1001969 purchase    2019-12-21 12:20:24.405000  PUS
5   1001969 view_item   2019-12-21 12:22:01.318000  ICN
es = ft.EntitySet(id="dataset")

variable_types = {
    'event_timestamp': ft.variable_types.Datetime,
    'user_id': ft.variable_types.Id,
    'origin': ft.variable_types.Categorical,
    'event_name': ft.variable_types.Categorical,
}

es.entity_from_dataframe(
    entity_id='total',
    dataframe=total,
    index='event_timestamp',
    variable_types=variable_types,
)

es.normalize_entity(
    base_entity_id='total',
    new_entity_id='users',
    index='user_id',
    copy_variables=['event_timestamp'],
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='total',
    new_entity_id='origin',
    index='origin',
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='total',
    new_entity_id='event_name',
    index='event_name',
    make_time_index=False,
)

我想要这样的结果

                                    NUM_UNIQUE(total.event_name)  NUM_UNIQUE(total.origin)
user_id time                                                                              
1001818 2019-10-30 09:15:38.810000                         1                             1
1001969 2019-12-21 12:11:06.595004                         1                             0
        2019-12-21 12:11:22.811008                         1                             0
        2019-12-21 12:20:24.405000                         1                             1
        2019-12-21 12:22:01.318000                         2                             2
1001790 2020-01-01 12:07:05.089002                         1                             1 

因此,如果我将窗口设置为 5 分钟,在 user_id 1001969 中,累积计数不应该在第二个和第三个之间起作用。

标签: featuretools

解决方案


您可以将训练窗口应用于滚动窗口的每个截止时间。以下是截止时间:

 user_id                       time
 1001969 2019-12-21 12:11:06.595004
 1001969 2019-12-21 12:11:22.811008
 1001969 2019-12-21 12:20:24.405000
 1001969 2019-12-21 12:22:01.318000

在 DFS 中,我对每个截止时间应用了 5 分钟的训练窗口。

fm, fd = ft.dfs(
    target_entity='users',
    entityset=es,
    agg_primitives=['num_unique'],
    trans_primitives=[],
    cutoff_time=cutoff_time,
    cutoff_time_in_index=True,
    training_window='5 minutes',
)

累积计数应与以下输出按预期工作。

                                    NUM_UNIQUE(total.origin)  NUM_UNIQUE(total.event_name)
user_id time                                                                              
1001969 2019-12-21 12:11:06.595004                       NaN                           NaN
        2019-12-21 12:11:22.811008                       NaN                           NaN
        2019-12-21 12:20:24.405000                       1.0                           1.0
        2019-12-21 12:22:01.318000                       2.0                           2.0

推荐阅读