首页 > 解决方案 > Snowflake: clustering on datetime key stored in variant field does not work / do partition pruning

问题描述

We are ingesting data into Snowflake via the kafka connector. To increase the data read performance / scan less partitions we decided to add a clustering key to a a key / combination of keys stored in the RECORD_CONTENT variant field.

The data in the RECORD_CONTENT field looks like this:

{
  "jsonSrc": {
    "Integerfield": 1,
    "SourceDateTime": "2020-06-30 05:33:08:345",
    *REST_OF_THE_KEY_VALUE_PAIRS*
}

Now, the issue is that clustering on a datetime col like SourceDateTime does NOT work:

CLUSTER BY (to_date(RECORD_CONTENT:jsonSrc:loadDts::datetime))

...while clustering on a field like Integerfield DOES work:

CLUSTER BY (RECORD_CONTENT:jsonSrc:Integerfield::int )

Not working means: when using a filter on RECORD_CONTENT:jsonSrc:loadDts::datetime, it has no effect on the partitions scanned, while filtering on RECORD_CONTENT:jsonSrc:Integerfield::int does perform partition pruning.

What is wrong here? Is this a bug?

Note that:

标签: database-performancesnowflake-cloud-data-platformclustering-key

解决方案


为了更好地修剪和减少存储消耗,如果您的半结构化数据包括:日期和时间戳,尤其是非 ISO 8601 日期和时间戳,我们建议将您的对象和关键数据扁平化到单独的关系列中,作为字符串值

字符串中的数字

数组

日期和时间戳等非本机值在加载到 VARIANT 列时存储为字符串,因此与存储在具有相应数据类型的关系列中相比,对这些值的操作可能会更慢并且消耗更多空间。

请参阅此链接: https ://docs.snowflake.com/en/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-结构体


推荐阅读