首页 > 解决方案 > 如何将 Spark SQL 批处理作业结果写入 Apache Druid?

问题描述

我想将 Spark 批处理结果数据写入 Apache Druid。我知道 Druid 有本地批量摄取,例如index_parallel. Druid 在同一个集群中运行 Map-Reduce 作业。但我只想用 Druid 作为数据存储。我想在 Spark 集群外部聚合数据,然后发送到 Druid 集群。

Druid 具有Tranquility实时摄取功能。我可以使用 Tranquility 发送批处理数据,但这效率不高。如何有效地将批处理结果发送给 Druid?

标签: apache-sparkapache-spark-sqldruid

解决方案


You can write to Kafka topic and run Kafka Indexing Job to indexing it.

We have been using this mechanism for indexing data but there is no such restriction of windowPeriod in that. It takes even older timestamp. But if a shard is already finalized, this ends up creating new shards in same segment.

e.g. if I am using day size segment and I will get to shards in that segment segment-11-11-2019-1 100MB segment-11-11-2019-2 10MB ( for data received on 12th Nov with event time for 11th Nov ).

With compaction, these two shards will be merged with auto compaction turned on.

https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html

https://druid.apache.org/docs/latest/tutorials/tutorial-compaction.html

Or simply you can accumulate results in HDFS and then use Hadoop Batch ingestion using cron jobs. Auto compaction works well for this option too.


推荐阅读