首页 > 解决方案 > Processing json data from kafka using structured streaming

问题描述

I want to convert incoming JSON data from Kafka into a dataframe.

I am using structured streaming with Scala 2.12

Most people add a hard coded schema, but if the json can have additional fields, it requires changing the code base every-time, which is tedious.

One approach is to write it into a file and infer it with but I rather avoid doing that.

Is there any other way to approach this problem?

Edit: Found a way to turn a json string into a dataframe but cant extract it from the stream source, it is possible to extract it?

标签: jsonscalaapache-kafkaschemaspark-structured-streaming

解决方案


  1. One way is to store the schema itself in the message headers (not in the key or value).

    Though, this increases message size, it will be easy to parse the JSON value without the need for any external resource like a file or a schema registry.

    New messages can have new schemas while at the same time old messages can still be processed using their old schema itself, because the schema is within the message itself.

  2. Alternatively, you can version the schemas and include an id for every schema in the message headers (or) a magic byte in the key or value and infer the schema from there.

    This approach is followed by Confluent Schema registry. It allows you to basically go through different versions of same schema and see how your schema has evolved over time.


推荐阅读