azure - Databricks 流到批处理过程

问题描述

我正在使用 Databricks，我正在享受Autoloader功能。基本上，它正在创建以微批处理方式使用数据的基础设施。它适用于初始原始表（或将其命名为青铜）。

当我有点迷失时，如何附加我的其他表格 - 分期（或将其命名为银色）。最复杂的部分是关于 staging (silver) 到 dw layer (gold) 的首要任务。使用 MERGE 命令是一种方式，但在规模上性能可能会下降。

我正在寻找为我的事实表提供流（微批处理）和批处理的最佳实践。

只是为了即我将添加我的云文件配置：

raw_df = (spark
          .readStream.format("cloudFiles")
          .options(**cloudfile)
          .load(raw_path)
         )

使用触发选项写入：（我想使用 ADF 安排作业）。

autoloader_query = (raw_df.writeStream
                 .format("delta")
                 .trigger(once=True)
                 .option("checkpointLocation",checkpoint_path)
                 .partitionBy("p_date","p_hour")
                 .table("raw_table")
                )

#Waiting end of autoloader
autoloader_query.awaitTermination()

#Show the output from the autoloader job
autoloader_query.recentProgress

我正在寻找流到批处理的最佳实践。谢谢！

标签： azureapache-sparkdatabricksazure-databricksaws-databricks

azure - Databricks 流到批处理过程

问题描述

解决方案

推荐阅读