首页 > 解决方案 > PySpark 最近的值替换组中的所有其他值

问题描述

我们有以下 PySpark 数据框:

+----+----------+----------+----------+---------+
|year|language_1| summary_1|language_2|summary_2|
+----+----------+----------+----------+---------+
|2013|      Java|     Great|    Python| Briliant|
|2014|    Python|   Awesome|     Scala| Horrible|
|2015|    Python|   Amazing|      Java|      Wow|
|2016|    Python|Incredible|       C++|     Nice|
|2017|     Scala|      Good|       C++|    Noway|
|2018|     Scala| Fantastic|       C++|     Cool|
+----+----------+----------+----------+---------+

这个问题有点难解释,请多多包涵。对于 langauges_1 和languages_2 中的所有相同语言,我希望能够使用“year”列作为决胜局调整summary_1 和summary_2 列的值,因此相同的语言应该选择该语言的最大年份的行,并且将 summary_1 和 summary_2 中的所有摘要更改为等于最大年份行的摘要)。因此,例如对于 Python,我希望能够用“Incredible”替换所有摘要,因为带有“Incredible”的行是 Python 最近的一年。等等等等。所以它会导致:

+----+----------+----------+----------+----------+
|year|language_1| summary_1|language_2| summary_2|
+----+----------+----------+----------+----------+
|2013|      Java|       Wow|    Python|Incredible|
|2014|    Python|Incredible|     Scala| Fantastic|
|2015|    Python|Incredible|      Java|       Wow|
|2016|    Python|Incredible|       C++|      Cool|
|2017|     Scala| Fantastic|       C++|      Cool|
|2018|     Scala| Fantastic|       C++|      Cool|
+----+----------+----------+----------+----------+

标签: apache-sparkpysparkapache-spark-sql

解决方案


不确定这是否是最好的方法,但您可以先融化数据框,以便只有 3 列(年份、语言、摘要),使用上一个问题的答案,然后旋转数据框以恢复原始结构:

df2 = df.selectExpr(
    'year', 
    'posexplode(array(struct(language_1 as language, summary_1 as summary), struct(language_2 as language, summary_2 as summary)))'
).select(
    'year', 'pos', 'col.*'
).withColumn(
    'summary', 
    F.max(F.struct('year', 'summary')).over(Window.partitionBy('language'))['summary']
).groupBy('year').pivot('pos').agg(
    F.first(F.struct('language', 'summary'))
).select(
    'year', '0.*', '1.*'
).toDF(*df.columns).orderBy('year')

df2.show()
+----+----------+----------+----------+----------+
|year|language_1| summary_1|language_2| summary_2|
+----+----------+----------+----------+----------+
|2013|      Java|       Wow|    Python|Incredible|
|2014|    Python|Incredible|     Scala| Fantastic|
|2015|    Python|Incredible|      Java|       Wow|
|2016|    Python|Incredible|       C++|      Cool|
|2017|     Scala| Fantastic|       C++|      Cool|
|2018|     Scala| Fantastic|       C++|      Cool|
+----+----------+----------+----------+----------+

推荐阅读