apache-spark - spark 2.2 无法处理聚合表达式中的映射列
问题描述
我怎样才能GROUP BY
或使用 DISTINCT
带有地图的复杂类型列?:
case class Foo(id:Int, stuff:Map[String, Int])
val xx = Seq(Foo(1, Map("first" -> 1, "second"->2)), Foo(1, Map("first" -> 1, "second"->2)), Foo(3, Map("fourth" -> 4, "fifth"->5))).toDF
xx.distinct.show
xx.groupBy("id", "stuff").count.show
错误是
expression `stuff` cannot be used as a grouping expression because its data type map<string,int> is not an orderable data type
也许在 spark 2.4 中修复?
但是,我目前仅限于 2.2。2.2有解决方案吗?
可以改为将其转换为json吗?我需要一个结构,每条记录具有不同的字段(spark 为每个组动态创建 struct/json)。
编辑
- 手动序列化为 JSON 是一种解决方法(但相当笨拙)
- 除了使用地图类型列,我还可以使用自定义案例类数组,即
Seq[Foo]; case class Foo(column:String, column_value:String, value:String)
. 这允许DISTINCT
工作,但格式对于任何第 3 方来说似乎都相当不直观
解决方案
推荐阅读
- julia - 在 Julia 中使用变量后 for 循环显着变慢
- haskell - No instance for (Integral Double) arising from a use of ‘floor’
- python - Why do I get Length of values (1) does not match length of index (3) when using random.sample()?
- javascript - Why can my website load bootstrap js but not quilljs with the new chrome cookie rules?
- javascript - filtering a data array so that it only return even index columns in d3
- java - how to save array elements made by onDataChange
- python - splitting hue and saturation from an image
- python - Calculating Properties Relative to Other Objects in Kivy
- python - psycopg2.ProgrammingError: can't adapt type 'QDate' pyqt sql
- azure - 如何在 Azure AD B2C 中创建仅电子邮件页面?