apache-spark - 如何将 Spark SQL 批处理作业结果写入 Apache Druid?
问题描述
我想将 Spark 批处理结果数据写入 Apache Druid。我知道 Druid 有本地批量摄取,例如index_parallel
. Druid 在同一个集群中运行 Map-Reduce 作业。但我只想用 Druid 作为数据存储。我想在 Spark 集群外部聚合数据,然后发送到 Druid 集群。
Druid 具有Tranquility
实时摄取功能。我可以使用 Tranquility 发送批处理数据,但这效率不高。如何有效地将批处理结果发送给 Druid?
解决方案
You can write to Kafka topic and run Kafka Indexing Job to indexing it.
We have been using this mechanism for indexing data but there is no such restriction of windowPeriod in that. It takes even older timestamp. But if a shard is already finalized, this ends up creating new shards in same segment.
e.g. if I am using day size segment and I will get to shards in that segment segment-11-11-2019-1 100MB segment-11-11-2019-2 10MB ( for data received on 12th Nov with event time for 11th Nov ).
With compaction, these two shards will be merged with auto compaction turned on.
https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html
https://druid.apache.org/docs/latest/tutorials/tutorial-compaction.html
Or simply you can accumulate results in HDFS and then use Hadoop Batch ingestion using cron jobs. Auto compaction works well for this option too.
推荐阅读
- ruby - 使用 ruby 点击 api 获取所有 jira 信息
- excel - 将公式放入 Range 的更快方法
- javascript - 如何在 form.valuechanges.subscribe() 中获取输入事件的 HTMLElement?
- azure - 删除 blob 时 Azure CLI 挂起
- css - React-native 使用 flexbox 定义组件高度
- bash - .net 核心“dotnet build”命令不会因错误而中断 Azure 构建管道 (bash)
- java - Jtable 外观:如何编写具有 jtable 属性的代码,可用于多个 jtable 以具有相似的外观?
- java - 端口上的最大连接数
- hibernate - Hibernate 无法使用 querydsl 定位命名参数
- python-3.x - 如何使用 Python xml.etree.ElementTree 库使用 DTD 验证 XML