首页 > 解决方案 > 更新输出模式下的 Watermark 是否会清除 Spark Structured Streaming 中的存储状态?

问题描述

我正在开发一个火花流应用程序,在了解接收器和水印逻辑的同时,我无法找到一个明确的答案,即我是否在以更新输出模式输出聚合时使用具有 10 分钟阈值的水印,是否会间歇性在 10 分钟阈值到期后清除 spark 保持的状态?

标签: apache-sparkspark-streamingspark-structured-streaming

解决方案


Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks back to a point in time (threshold) before which it is assumed no more late events are supposed to arrive, but if they do, they are discarded.

As a consequence one needs to maintain the state of window / aggregate already computed to handle these potential late updates based on event time. However, this costs resources, and if done infinitely, this would blow up a Structured Streaming App.

Will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired? Yes, it will. There is by design as there is no point holding any longer a state that can no longer be updated due to the threshold having been expired.

You need to run through some simple examples as I note it is easy to forget the subtlety of output.

See Why does streaming query with update output mode print out all rows? which gives an excellent example of update mode output as well. Also this gives an even better update example: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Even better - this blog with some good graphics: https://towardsdatascience.com/watermarking-in-spark-structured-streaming-9e164f373e9


推荐阅读