featuretools - 将特征生成限制为 FeatureTools 中的特定实体
问题描述
我试图了解如何primitive_options
在 FeatureTools(0.16 版)中指定仅包含某个实体。根据我应该使用的文档include_entities
:
为基元创建特征时要包含的实体列表。所有其他实体将被忽略 (list[str])。
简单案例
这是一些示例代码:
import pprint
from featuretools.primitives import GreaterThanScalar
esd1 = ft.demo.load_mock_customer(return_entityset=True)
def run_dfs(esd, primitive_options={}):
feature_defs = ft.dfs(
entityset=esd,
target_entity="customers",
agg_primitives=["count"],
where_primitives=["count",GreaterThanScalar(value=0)],
trans_primitives=[GreaterThanScalar(value=0)],
primitive_options=primitive_options,
max_depth=4,
features_only=True
)
pprint.pprint(feature_defs)
run_dfs(esd1)
这会产生:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions) > 0>]
假设我对会话和事务计数以及会话是否大于 0 感兴趣。基于我将在include_entities
此处查找的文档:
run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"include_entities":['sessions']}
})
然而,由此产生的输出是:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>]
两个 GreaterThanScalar 功能现在都消失了。如果我ignore_entities
改用,我会得到:
run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
}
})
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>]
所以它有效,但我不确定为什么ignore_entities
会给出我需要的结果而include_entities
不是。我错过了什么吗?
更复杂的案例
虽然我有点让简单的案例工作,但我真正想要的是更复杂的东西。我想获得一个布尔功能,告诉我在特定设备上是否有超过零个会话。
这样做:
esd2 = ft.demo.load_mock_customer(return_entityset=True)
esd2['sessions'].add_interesting_values()
run_dfs(esd2)
产生:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions) > 0>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(sessions WHERE device = desktop) > 0>,
<Feature: COUNT(sessions WHERE device = tablet) > 0>,
<Feature: COUNT(sessions WHERE device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = tablet) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = desktop) > 0>]
我需要的功能是从底部算起 4 到 6 个。如果我尝试将dfs
自身限制为会话实体和设备变量:
run_dfs(esd2, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["device"]}
}
})
结果是:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>]
没有 GreaterThanScalar 特征。
有没有办法dfs
让我在这里只提供我想要的三个 GreaterThanScalar 功能?
更新:第三种情况
有没有办法限制什么被计算在内where
?例如:
esd3 = ft.demo.load_mock_customer(return_entityset=True)
esd3['sessions'].add_interesting_values()
esd3['products'].add_interesting_values()
run_dfs(esd3, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions","sessions"],
},
"count":{
"ignore_variables":{"transactions":['session_id']}
}
})
给出:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE products.brand = B)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(transactions WHERE products.brand = A)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>]
是否可以将COUNT(transactions WHERE ...)
功能限制为仅products
. 我仍然想保留这些COUNT sessions ...
功能。
解决方案
将 'sessions' 实体中的 'session_id' 添加到include_variables
选项将生成您正在寻找的功能:
primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["session_id", "device"]}}}
该Count
原语使用实体索引作为其基础,以及任何where
列。如果您只包含原始选项的where
列,最终会忽略所有功能,因为它们都使用隐式忽略的列(实体索引)。在这种情况下,所需的变量使用“会话”实体,因此将“会话”实体索引(“会话 ID”)添加到选项允许生成所需的特征。GreaterThanScalar
dfs
Count
GreaterThanScalar
Count
included_variables
此外,在使用 的第一个示例中include_entities
,GreaterThanScalar
由于未包含“客户”实体(目标实体),因此功能丢失。这些Count
特征都是“客户”实体中的聚合特征;它们代表每个客户的物品数量。为了使用这些Count
特性,GreaterThanScalar
需要允许原语使用特性所在的“客户”实体Count
以及所需Count
特性所基于的实体(在这种情况下为“会话”)。