apache-spark - 更新输出模式下的 Watermark 是否会清除 Spark Structured Streaming 中的存储状态?
问题描述
我正在开发一个火花流应用程序,在了解接收器和水印逻辑的同时,我无法找到一个明确的答案,即我是否在以更新输出模式输出聚合时使用具有 10 分钟阈值的水印,是否会间歇性在 10 分钟阈值到期后清除 spark 保持的状态?
解决方案
Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks back to a point in time (threshold) before which it is assumed no more late events are supposed to arrive, but if they do, they are discarded.
As a consequence one needs to maintain the state of window / aggregate already computed to handle these potential late updates based on event time. However, this costs resources, and if done infinitely, this would blow up a Structured Streaming App.
Will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired? Yes, it will. There is by design as there is no point holding any longer a state that can no longer be updated due to the threshold having been expired.
You need to run through some simple examples as I note it is easy to forget the subtlety of output.
See Why does streaming query with update output mode print out all rows? which gives an excellent example of update mode output as well. Also this gives an even better update example: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Even better - this blog with some good graphics: https://towardsdatascience.com/watermarking-in-spark-structured-streaming-9e164f373e9
推荐阅读
- c# - 在 ASP .Net 的一个 zip 文件夹中下载多个 csv 文件
- sql - Flink Create View 或 Table as Select
- angular - 每次按下添加购物车按钮时,我都会尝试在产品模板上显示产品数量
- python - Autokeras 在 Trial 1 后消耗所有 GPU
- java - 使用现有声音输入在 Java 中为 Windows 创建麦克风
- python - 模板已就位,但 Flask 在查找 HTML 文件时仍然给出“404 Not Found”页面
- firebase - Future Builder 需要永远在颤动中获取图像
- html - 如何创建一个 HTML 表单,在没有 PHP 的情况下通过电子邮件将响应发送到您的收件箱?
- swift - 队列任务完成前 Swift Spinner 消失
- salesforce-marketing-cloud - 自定义旅程构建器活动短信 - Salesforce Marketing Cloud