首页 > 解决方案 > Scala 中 Derived DataFrame 的持久化工作原理及其对性能的影响

问题描述

您能否通过以下示例解释在 scala 中持久化和取消持久化数据帧的效果?持久化/非持久化对派生数据帧有什么影响?从下面的示例中,我不再保留 dcRawAll,因为它不再被使用。但是,我读到在派生数据帧上的所有操作完成之前,我们不应该取消持久化数据帧,因为缓存被删除(或不会被创建)。(假设所有数据帧在取消持久化之前对它们进行了更多操作)。

您能否解释一下对以下查询的性能影响?可以做些什么来优化它?

在此先感谢您的帮助。

    val dcRawAll = dataframe.select("C1","C2","C3","C4")   //dataframe is persisted
    dcRawAll.persist()

    val statsdcRawAll = dcRawAll.count()

    val dc = dcRawAll.where(col("c1").isNotNull)

    dc.persist()
    dcRawAll.unpersist(false)

    val statsdc = dc.count()

    val dcclean = dc.where(col("c2")=="SomeValue")
    dcclean.persist()
    dc.unpersist()

标签: scalaperformancedataframeapache-sparkpersist

解决方案


您的代码,正如当前实现的那样,根本没有做任何缓存。您必须记住,该.persist()方法不会对您的 执行任何副作用Dataframe,它只是返回一个具有持久化能力的新方法。 Dataframe

在您给您的电话dcRawAll.persist()中没有分配结果,因此您没有Dataframe可以保留的参考。纠正那个(非常常见的)错误,缓存仍然没有帮助你希望的方式。下面我将评论您的代码,更详细地解释执行期间可能发生的情况。

//dcRawAll will contian a Dataframe, that will be cached after its next action
val dcRawAll = dataframe.select("C1","C2","C3","C4").persist()

//after this line, dcRawAll is calculated, then cached
val statsdcRawAll = dcRawAll.count()

//dc will contain a Dataframe that will be cached after its next action
val dc = dcRawAll.where(col("c1").isNotNull).persist()

//at this point, you've removed the dcRawAll cache never having used it
//since dc has never had an action performed yet
//if you want to make use of this cache, move the unpersist _after_ the
//dc.count()
dcRawAll.unpersist(false)

//dcRawAll is recalculated from scratch, and then dc is calculated from that
//and then cached
val statsdc = dc.count()

//dcclean will contain a dataframe that will be cached after its next action
val dcclean = dc.where(col("c2")=="SomeValue").persist()

//at this point, you've removed the dc cache having never used it
//if you perform a dcclean.count() before this, it will utilize the dc cache
//and stage the cache for dcclean, to be used on some other dcclean action
dc.unpersist()

基本上,您需要确保在任何依赖于它的操作都已执行之前不要执行此操作.unpersist()。阅读答案(以及链接的文档)以更好地了解转换和操作之间的区别。DataframeDataframe


推荐阅读