首页 > 解决方案 > Spark 允许过滤/选择不存在的列

问题描述

我遇到了一个非常奇怪的行为——或者至少是一个非预期的行为——使用 Spark SQL 库版本 2.2.1。我有一个简单的数据框:

val df = spark.read.parquet("...")
df: org.apache.spark.sql.DataFrame = [FIELD_A: string, FIELD_B: string ... 244 more fields]

val tmp = df.select("FIELD_A")
tmp: org.apache.spark.sql.DataFrame = [FIELD_A: string]

如果我执行以下操作,我会得到意想不到的行为:

// the following line does not work (as expected)
tmp.select("FIELD_B").count()

// the following line WORKS (unexpectedly)
tmp.where($"FIELD_B" === "abc").count()

为什么我可以过滤不应该出现在数据框中的非选定列tmp?我在网上找不到任何原因或相关解释。

谢谢!

标签: scalaapache-sparkapache-spark-sql

解决方案


看起来优化器重新排列了select(项目)和where(过滤器)阶段,而没有首先检查原始计划是否实际有效。

scala> case class Foo(FIELD_A: String, FIELD_B: String)
defined class Foo

scala> val tmp = List(Foo("q", "w"), Foo("e", "r")).toDF
tmp: org.apache.spark.sql.DataFrame = [FIELD_A: string, FIELD_B: string]

scala> tmp.select("FIELD_A").where($"FIELD_B" === "r").explain(true)
== Parsed Logical Plan ==
'Filter ('FIELD_B = r)
+- Project [FIELD_A#2]
   +- LocalRelation [FIELD_A#2, FIELD_B#3]

== Analyzed Logical Plan ==
FIELD_A: string
Project [FIELD_A#2]
+- Filter (FIELD_B#3 = r)
   +- Project [FIELD_A#2, FIELD_B#3]
      +- LocalRelation [FIELD_A#2, FIELD_B#3]

== Optimized Logical Plan ==
Project [FIELD_A#2]
+- Filter (isnotnull(FIELD_B#3) && (FIELD_B#3 = r))
   +- LocalRelation [FIELD_A#2, FIELD_B#3]

== Physical Plan ==
*Project [FIELD_A#2]
+- *Filter (isnotnull(FIELD_B#3) && (FIELD_B#3 = r))
   +- LocalTableScan [FIELD_A#2, FIELD_B#3]

scala> tmp.select("FIELD_A").where($"FIELD_B" === "r").collect()
res5: Array[org.apache.spark.sql.Row] = Array([e])

推荐阅读