apache-spark - 截断表后刷新的缓存数据帧

问题描述

以下是步骤：

scala> val df = sql("select * from table")
df: org.apache.spark.sql.DataFrame = [num: int]

scala> df.cache
res13: df.type = [num: int]

scala> df.collect
res14: Array[org.apache.spark.sql.Row] = Array([10], [10])

scala> df
res15: org.apache.spark.sql.DataFrame = [num: int]

scala> df.show
+---+
|num|
+---+
| 10|
| 10|
+---+


scala> sql("truncate table table")
res17: org.apache.spark.sql.DataFrame = []

scala> df.show
+---+
|num|
+---+
+---+

我的问题是为什么 df 被刷新？我的期望是它应该被缓存在内存中并且截断不应该删除数据。

任何想法将不胜感激。

谢谢

标签： apache-sparkapache-spark-sql

你永远不应该依赖cache正确性。Spark是性能优化的，即使是cache最防御性的StorageLevel（MEMORY_AND_DISK_SER_2

与您的问题中使用的代码类似的代码可能在某些情况下有效，但不要假设它是有保证的或确定性的行为。

apache-spark - 截断表后刷新的缓存数据帧

问题描述

解决方案

推荐阅读