首页 > 解决方案 > 如何在 Pyspark 中过滤数组列值

问题描述

我有一个pyspark Dataframe包含许多列的列,其中列作为数组类型和字符串列:

numbers  <Array>              |    name<String>
------------------------------|----------------
["160001","160021"]           |     A
------------------------------|----------------
["160001","1600", "42345"]    |     B
------------------------------|----------------
["160001","9867", "42345"]    |     C
------------------------------|----------------
["160001","8650", "2345"]     |     A
------------------------------|----------------
["2456","78568", "42345"]     |     B
-----------------------------------------------

我想从数字列中跳过包含 4 位数字的数字if the name column is not "B". And keep it if the name column is "B". 例如:

In the lines 2 and 5, I have "1600" and "2456" contains 4 digits并且名称列是“B”,我应该将它们保留在列值中:

------------------------------|----------------
["160001","1600", "42345"]    |     B
------------------------------|----------------
["2456","78568", "42345"]     |     B
-----------------------------------------------

在第 3 行和第 4 行中,我有包含 4 位数字的数字列,但列名与“B”不同 ==> 所以我应该跳过它们。

例子:

------------------------------|----------------
["160001","9867", "42345"]    |     C
------------------------------|----------------
["160001","8650", "2345"]     |     A
------------------------------|----------------

预期结果:

    numbers  <Array>              |    name<String>
------------------------------|----------------
["160001","160021"]           |     A
------------------------------|----------------
["160001","1600", "42345"]    |     B
------------------------------|----------------
["160001", "42345"]           |     C
------------------------------|----------------
["160001"]                    |     A
------------------------------|----------------
["2456","78568", "42345"]     |     B
-----------------------------------------------

我该怎么做 ?谢谢

标签: apache-sparkpysparkapache-spark-sql

解决方案


从 Spark 2.4 开始,您可以使用高阶函数FILTER来过滤数组。将其与if表达式结合应该可以解决问题:

df.selectExpr("if(name != \'B', FILTER(numbers, x -> length(x) != 4), numbers) AS numbers", "name")

推荐阅读