python - 如何在pyspark中将2列聚合到地图中
问题描述
我有一个像这样的数据框
a = spark.createDataFrame([['Alice', '2020-03-03', '1'], ['Bob', '2020-03-03', '1'], ['Bob', '2020-03-05', '2']], ['name', 'dt', 'hits'])
a.show()
+-----+----------+----+
| name| dt|hits|
+-----+----------+----+
|Alice|2020-03-03| 1|
| Bob|2020-03-03| 1|
| Bob|2020-03-05| 2|
+-----+----------+----+
我想聚合 dt 并点击 Columns 到地图中 -
+-----+------------------------------------+
| name| map |
+-----+------------------------------------+
|Alice| {'2020-03-03': 1, '2020-03-05':2}|
| Bob| {'2020-03-03': 1} |
+-----+------------------------------------+
但是这段代码抛出了一个异常:
from pyspark.sql import functions as F
a = a.groupBy(F.col('name')).agg(F.create_map(F.col('dt'), F.col('hits')))
Py4JJavaError: An error occurred while calling o2920.agg.
: org.apache.spark.sql.AnalysisException: expression '`dt`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [name#1329], [name#1329, map(dt#1330, hits#1331) AS map(dt, hits)#1361]
+- LogicalRDD [name#1329, dt#1330, hits#1331], false
我究竟做错了什么?
解决方案
对于spark2.4+
,您可以map_from_arrays
这样使用:
from pyspark.sql import functions as F
a.groupBy("name").agg(F.map_from_arrays(F.collect_list("dt"),\
F.collect_list("hits")).alias("map")).show(truncate=False)
#+-----+----------------------------------+
#|name |map |
#+-----+----------------------------------+
#|Bob |[2020-03-03 -> 1, 2020-03-05 -> 2]|
#|Alice|[2020-03-03 -> 1] |
#+-----+----------------------------------+
推荐阅读
- c# - 在字符串中查找未知子字符串
- django - 如何显示年份和月份的两个日期差异?
- google-apps-script - 例外:尽管范围是从数据创建的,但数据中的列数与范围中的列数不匹配
- reactjs - React-Native Animated.interpolate 不适用于 Styled-Components
- huggingface-transformers - 拥抱脸总结
- sql - 如何在 Bigquery 存储过程中添加过滤器 sql `WHERE` CLAUSE?
- windows - 2个列表框选定项目gui
- regex - 如何使正则表达式区分大小写
- javascript - 显示缩放的内容
- python - 加速 Raspberry Pi 性能