首页 > 解决方案 > Apache Flink,逻辑或物理运算符中的Keyby数据分布?

问题描述

根据 Apache Flink 文档,KeyBy 转换在逻辑上将流划分为不相交的分区。具有相同键的所有记录都分配到同一个分区。

KeyBy 是 100% 逻辑转换吗?它不包括跨集群节点分布的物理数据分区吗?如果是这样,那么如何保证所有具有相同键的记录都分配到同一个分区?

例如,假设我们从 n 个节点的 Apache Kafka 集群获取分布式数据流。运行我们的流式作业的 Apache Flink 集群由 m 个节点组成。当 keyBy 转换应用于传入的数据流时,它如何保证逻辑数据分区?或者它是否涉及跨集群节点的物理数据分区?

似乎我对逻辑数据分区和物理数据分区感到困惑。

标签: apache-flinkdistributed-computingflink-streamingdata-partitioning

解决方案


The keyspace of all possible keys is divided into some number of key groups. The number of key groups (which is the same as the maximum parallelism) is a configuration parameter you can set when setting up a Flink cluster; the default value is 128.

Each key belongs to exactly one key group. When a cluster is launched, the key groups are divided among the task managers -- and if the cluster is started from a checkpoint or savepoint, those snapshots are indexed by key group, and each task manager loads the state for the keys in the key groups it has been assigned.

While a job is running, every task manager knows the key selector functions used to compute the keys, and how keys map onto key groups. The TMs also know the partitioning of key groups to task managers. This makes it straightforward to route each message to the task manager responsible for that message's key.

Details:

The key group that a key belongs to is computed roughly like this:

Object key = the result of your KeySelector function;
int keyHash = key.hashCode();
int keyGroupId = MathUtils.murmurHash(keyHash) % maxParallelism;

The index of the operator instance to which elements from a given key group should be routed given the actual parallelism and maxParallelism is computed as

keyGroupId * parallelism / maxParallelism

The actual code is in org.apache.flink.runtime.state.KeyGroupRangeAssignment if you want to take a look.

One major takeaway is that the key groups are disjoint, and they span the keyspace. In other words, it's not possible for a key to come along that doesn't belong to one of the key groups. Every key belongs to exactly one of the key groups, and every key group belongs to one of the task managers.


推荐阅读