python - Removing null values from Array after merging columns- pyspark
问题描述
I have this pyspark dataframe
df:
+---------+----+----+----+----+----+----+----+----+----+
|partition| 1| 2| 3| 4| 5| 6| 7| 8| 9|
+---------+----+----+----+----+----+----+----+----+----+
| 7|null|null|null|null|null|null| 0.7|null|null|
| 1| 0.2| 0.1| 0.3|null|null|null|null|null|null|
| 8|null|null|null|null|null|null|null| 0.8|null|
| 4|null|null|null| 0.4| 0.5| 0.6|null|null| 0.9|
+---------+----+----+----+----+----+----+----+----+----+
which I combined the columns of
+---------+--------------------+
|partition| vec_comb|
+---------+--------------------+
| 7| [,,,,,,,, 0.7]|
| 1|[,,,,,, 0.1, 0.2,...|
| 8| [,,,,,,,, 0.8]|
| 4|[,,,,, 0.4, 0.5, ...|
+---------+--------------------+
How can I remove the NullTypes
from the arrays of vec_comb
column?
Expected output:
+---------+--------------------+
|partition| vec_comb|
+---------+--------------------+
| 7| [0.7]|
| 1| [0.1, 0.2,0.3]|
| 8| [0.8]|
| 4|[0.4, 0.5, 0.6, 0,9]|
+---------+--------------------+
I've tried (obviously wrong, but I can't wrap my head arround this):
def clean_vec(array):
new_Array = []
for element in array:
if type(element)==FloatType():
new_Array.append(element)
return new_Array
udf_clean_vec = F.udf(f=(lambda c: clean_vec(c)), returnType=ArrayType(FloatType()))
df = df.withColumn('vec_comb_cleaned', udf_clean_vec('vec_comb'))
解决方案
不使用特定于 pyspark 的功能,您也可以list
通过直接filter
输出NaN
s 来创建 a:
df['vec_comb'] = df.iloc[:, 1:10].apply(lambda r: list(filter(pd.notna, r)) , axis=1)
df
# Output:
partition 1 2 3 4 5 6 7 8 9 vec_comb
0 7 NaN NaN NaN NaN NaN NaN 0.7 NaN NaN [0.7]
1 1 0.2 0.1 0.3 NaN NaN NaN NaN NaN NaN [0.2, 0.1, 0.3]
2 8 NaN NaN NaN NaN NaN NaN NaN 0.8 NaN [0.8]
3 4 NaN NaN NaN 0.4 0.5 0.6 NaN NaN 0.9 [0.4, 0.5, 0.6, 0.9]
并通过仅选择您想要的两个来删除旧列:
df = df[['partition', 'vec_comb']]
df
# Output:
partition vec_comb
0 7 [0.7]
1 1 [0.2, 0.1, 0.3]
2 8 [0.8]
3 4 [0.4, 0.5, 0.6, 0.9]
推荐阅读
- android - JNI 调用未启动 Android 类
- database - Mongoose 返回子文档中按字段分组的值的平均值
- javascript - 递归搜索数组中的嵌套对象并更新其子对象
- c# - Xamarin Android 响应来电和去电
- mysql - 通过连接表和按每个表中的列排序来优化 MySQL 查询
- python-3.x - Pyre-check 安装问题
- unit-testing - 如何修复 Vuetify 单元测试中的“ReferenceError: requestAnimationFrame is not defined”
- coldfusion - 是否可以将函数作为参数传递给 ColdFusion 中的其他函数?如果是怎么办?
- c - C 中的 bsearch() 在新的 gcc 版本 7.4.0 中给出分段错误
- python-3.x - 子程序后变量不保存