apache-spark - 在 Spark SQL 中使用地图数据类型查询配置单元表时出错。但是在 HiveQL 中执行时工作
问题描述
我有以下结构的蜂巢表
+---------------+--------------+----------------------+
| column_value | metric_name | key |
+---------------+--------------+----------------------+
| A37B | Mean | {0:"202006",1:"1"} |
| ACCOUNT_ID | Mean | {0:"202006",1:"2"} |
| ANB_200 | Mean | {0:"202006",1:"3"} |
| ANB_201 | Mean | {0:"202006",1:"4"} |
| AS82_RE | Mean | {0:"202006",1:"5"} |
| ATTR001 | Mean | {0:"202007",1:"2"} |
| ATTR001_RE | Mean | {0:"202007",1:"3"} |
| ATTR002 | Mean | {0:"202007",1:"4"} |
| ATTR002_RE | Mean | {0:"202007",1:"5"} |
| ATTR003 | Mean | {0:"202008",1:"3"} |
| ATTR004 | Mean | {0:"202008",1:"4"} |
| ATTR005 | Mean | {0:"202008",1:"5"} |
| ATTR006 | Mean | {0:"202009",1:"4"} |
| ATTR006 | Mean | {0:"202009",1:"5"} |
我需要编写一个 spark sql 查询来基于具有 NOT IN 条件的 Key 列以及两个键的组合进行过滤。
以下查询在 Beeline 的 HiveQL 中运行良好
select * from your_data where key[0] between '202006' and '202009' and key NOT IN ( map(0,"202009",1,"5") );
但是当我在 Spark SQL 中尝试相同的查询时。我收到错误
由于数据类型不匹配而无法解析:在 org.apache.spark.sql.catalyst 的 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) 的 map<int,string>。 analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:115) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3 .applyOrElse(CheckAnalysis.scala:107) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees .TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql。 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3 上的催化剂.trees.TreeNode.transformUp(TreeNode.scala:277)。在 org.apache.spark.sql.catalyst.trees.TreeNode$$$anonfun$3.apply(TreeNode.scala:275) 在 org.apache.spark.sql.catalyst.trees.TreeNode$ 处应用(TreeNode.scala:275) $anonfun$4.apply(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode。 mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3。在 org.apache.spark.sql.catalyst.trees.TreeNode$$$anonfun$3.apply(TreeNode.scala:275) 在 org.apache.spark.sql.catalyst.trees.TreeNode$ 处应用(TreeNode.scala:275) $anonfun$4.apply(TreeNode.scala:326)在 org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren (TreeNode. scala:324) 在 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) 在 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode. scala:275) 在 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275) 在 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4。应用(TreeNode.scala:326)在 org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode. scala:324) 在 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) 在 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode. scala:275) 在 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275) 在 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4。应用(TreeNode.scala:326)在 org.apache.spark.sql.catalyst.trees.TreeNode$$$anonfun$3.apply(TreeNode.scala:275) 在 org.apache.spark.sql.catalyst.trees.TreeNode$ 处应用(TreeNode.scala:275) $anonfun$4.apply(TreeNode.scala:326)在 org.apache.spark.sql.catalyst.trees.TreeNode$$$anonfun$3.apply(TreeNode.scala:275) 在 org.apache.spark.sql.catalyst.trees.TreeNode$ 处应用(TreeNode.scala:275) $anonfun$4.apply(TreeNode.scala:326)
请帮忙!
解决方案
我从之前提出的不同问题中得到了答案。此查询工作正常
select * from your_data where key[0] between 202006 and 202009 and NOT (key[0]="202009" and key[1]="5" );