to a PCollection to avoid deserialization?,java,serialization,apache-kafka,apache-beam"/>

首页 > 解决方案 > Can you "cast" a PCollection to a PCollection to avoid deserialization?

问题描述

My situation is that I'm writing records out to Kafka after a reshuffle and I'd like to avoid deserializing to my object type and then reserializing to the same byte array before writing out to Kafka. My pipeline is currently CPU bound and it seems like this would be a good way to reduce CPU usage.

I'm using the Flink runner (Flink 1.12). My data is coming from Kafka (not controlled by me) in big compressed chunks, and I'm breaking the chunks into individual records (fanout is 10ks or 100ks of output records per input record) and writing to another Kafka topic. To get reasonable throughput I have to reshuffle the output records to multiple workers, but this incurs some deserialization and serialization cost.

My Kafka serializer / deserializer actually delegate to the appropriate Beam coder, so the serialized content should be the same for both systems.

Actually, my type is PCollection<KV<K, V>>, and I'd like to cast to PCollection<KV<K, byte[]>>, but if the single-type is already possible then that may suffice for my needs.

标签: javaserializationapache-kafkaapache-beam

解决方案


推荐阅读