首页 > 解决方案 > 在pyspark中将值随机更改为空值的最有效方法是什么?

问题描述

试图弄清楚如何用空值随机替换 Pyspark 中的特定列。所以改变这样的数据框:

| A  | B  |
|----|----|
| 1  | 2  |
| 3  | 4  |
| 5  | 6  |
| 7  | 8  |
| 9  | 10 |
| 11 | 12 |

并将“B”列中 25% 的值随机更改为空值:

| A  | B    |
|----|------|
| 1  | 2    |
| 3  | NULL |
| 5  | 6    |
| 7  | NULL |
| 9  | NULL |
| 11 | 12   |

标签: python-3.xapache-sparkpyspark

解决方案


thanks to @pault I was able to answer my own question using the question he posted that you can find here

Essentially I ran something like this:

import pyspark.sql.functions as f
df1 = df.withColumn('Val', f.when(f.rand() > 0.25, df1['Val']).otherwise(f.lit(None))

Which will randomly select values with the column 'Val' and make it into a None value


推荐阅读