首页 > 解决方案 > 将特征生成限制为 FeatureTools 中的特定实体

问题描述

我试图了解如何primitive_options在 FeatureTools(0.16 版)中指定仅包含某个实体。根据我应该使用的文档include_entities

为基元创建特征时要包含的实体列表。所有其他实体将被忽略 (list[str])。

简单案例

这是一些示例代码:

import pprint
from featuretools.primitives import GreaterThanScalar

esd1 = ft.demo.load_mock_customer(return_entityset=True)

def run_dfs(esd, primitive_options={}):
    feature_defs = ft.dfs(
        entityset=esd,
        target_entity="customers",
        agg_primitives=["count"],
        where_primitives=["count",GreaterThanScalar(value=0)],
        trans_primitives=[GreaterThanScalar(value=0)],
        primitive_options=primitive_options,
        max_depth=4,
        features_only=True
    )
    pprint.pprint(feature_defs)

run_dfs(esd1)

这会产生:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions) > 0>,
 <Feature: COUNT(transactions) > 0>]

假设我对会话和事务计数以及会话是否大于 0 感兴趣。基于我将在include_entities此处查找的文档:

run_dfs(esd1, primitive_options={
          "greater_than_scalar":{
              "include_entities":['sessions']}
        })

然而,由此产生的输出是:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>]

两个 GreaterThanScalar 功能现在都消失了。如果我ignore_entities改用,我会得到:

run_dfs(esd1, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions"],
            }
        })

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions) > 0>]

所以它有效,但我不确定为什么ignore_entities会给出我需要的结果而include_entities不是。我错过了什么吗?

更复杂的案例

虽然我有点让简单的案例工作,但我真正想要的是更复杂的东西。我想获得一个布尔功能,告诉我在特定设备上是否有超过零个会话。

这样做:

esd2 = ft.demo.load_mock_customer(return_entityset=True)
esd2['sessions'].add_interesting_values()
run_dfs(esd2)

产生:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions) > 0>,
 <Feature: COUNT(sessions) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>,
 <Feature: COUNT(sessions WHERE device = desktop) > 0>,
 <Feature: COUNT(sessions WHERE device = tablet) > 0>,
 <Feature: COUNT(sessions WHERE device = mobile) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop) > 0>]

我需要的功能是从底部算起 4 到 6 个。如果我尝试将dfs自身限制为会话实体和设备变量:

run_dfs(esd2, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions"],
                "include_variables":{"sessions":["device"]}
            }
        })

结果是:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>]

没有 GreaterThanScalar 特征。

有没有办法dfs让我在这里只提供我想要的三个 GreaterThanScalar 功能?

更新:第三种情况

有没有办法限制什么被计算在内where?例如:

esd3 = ft.demo.load_mock_customer(return_entityset=True)
esd3['sessions'].add_interesting_values()
esd3['products'].add_interesting_values()

run_dfs(esd3, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions","sessions"],
            },
            "count":{
                "ignore_variables":{"transactions":['session_id']}
            }
        })

给出:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE products.brand = B)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>,
 <Feature: COUNT(transactions WHERE products.brand = A)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>]

是否可以将COUNT(transactions WHERE ...)功能限制为仅products. 我仍然想保留这些COUNT sessions ...功能。

标签: featuretools

解决方案


将 'sessions' 实体中的 'session_id' 添加到include_variables选项将生成您正在寻找的功能:

primitive_options={
    "greater_than_scalar":{
         "ignore_entities":["transactions"],
         "include_variables":{"sessions":["session_id", "device"]}}}

Count原语使用实体索引作为其基础,以及任何where列。如果您只包含原始选项的where列,最终会忽略所有功能,因为它们都使用隐式忽略的列(实体索引)。在这种情况下,所需的变量使用“会话”实体,因此将“会话”实体索引(“会话 ID”)添加到选项允许生成所需的特征。GreaterThanScalardfsCountGreaterThanScalarCountincluded_variables

此外,在使用 的第一个示例中include_entitiesGreaterThanScalar由于未包含“客户”实体(目标实体),因此功能丢失。这些Count特征都是“客户”实体中的聚合特征;它们代表每个客户的物品数量。为了使用这些Count特性,GreaterThanScalar需要允许原语使用特性所在的“客户”实体Count以及所需Count特性所基于的实体(在这种情况下为“会话”)。


推荐阅读