python - 当值在列表中时,Pyspark 替换 DF 值
问题描述
我正在尝试编写一个 pyspark 脚本来清除 pyspark df 中的信息。我的 df 看起来像:
hashed_customer firstname lastname email order_id status timestamp
eater 1_uuid 1_firstname 1_lastname 1_email 12345 OPTED_IN 2020-05-14 20:45:15
eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22
eater 3_uuid 3_firstname 3_lastname 3_email 34567 OPTED_IN 2020-05-14 19:31:55
eater 4_uuid 4_firstname 4_lastname 4_email 45678 OPTED_IN 2020-05-14 17:49:27
我与需要从 customer_temp_tb 表中删除的客户有另一个 pyspark df,如下所示:
hashed_customer eaterstatus
eater 1_uuid OPTED_OUT
eater 3_uuid OPTED_OUT
如果用户在第二个 df 中,我正在尝试找到一种从第一个 df 中删除名字、姓氏和电子邮件的方法。到目前为止,我已经使用以下方法从第二个 df 创建了 hashed_customers 列表:
cust_opt_out_id = [row.hashed_eater_uuid for row in df_out.collect()]
现在,如果 hashed_customer ID 在第二个 df 中,我正在尝试找到一种从第一个 df 中删除名字、姓氏和电子邮件的方法,以便最终结果如下所示:
hashed_customer firstname lastname email order_id status timestamp
eater 1_uuid NaN NaN NaN 12345 OPTED_IN 2020-05-14 20:45:15
eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22
eater 3_uuid NaN NaN NaN 34567 OPTED_IN 2020-05-14 19:31:55
eater 4_uuid 4_firstname 4_lastname 4_email 45678 OPTED_IN 2020-05-14 17:49:27
我怎样才能创建一个函数来做到这一点?我知道在熊猫中这很简单:
df_cust_out.loc[df_in['hashed_customer'].isin(cust_opt_out_id),['firstname','lastname', 'email']]=np.nan
但这在 pyspark 中不起作用。
解决方案
如果我要复制您的确切逻辑,我们可以执行以下操作(内联评论):
l = df2.select("hashed_customer").collect()
cols_to_update = ['firstname','lastname','email'] # list of cols to update
#use when + otherwise in a loop for the cols_to_update
cond = [F.when(F.col('hashed_customer').isin([i[0] for i in l]),
F.lit(None)).otherwise(F.col(col)).alias(col)
for col in cols_to_update]
#select the changed columns and other columns
final = df1.select(*cond,*[a for a in df1.columns if a not in cols_to_update])
#order as the original dataframe
final.select(*df1.columns).show()
+---------------+-----------+----------+-------+--------+--------+-------------------+
|hashed_customer| firstname| lastname| email|order_id| status| timestamp|
+---------------+-----------+----------+-------+--------+--------+-------------------+
| eater 1_uuid| null| null| null| 12345|OPTED_IN|2020-05-14 20:45:15|
| eater 2_uuid|2_firstname|2_lastname|2_email| 23456|OPTED_IN|2020-05-14 20:29:22|
| eater 3_uuid| null| null| null| 34567|OPTED_IN|2020-05-14 19:31:55|
| eater 4_uuid|4_firstname|4_lastname|4_email| 45678|OPTED_IN|2020-05-14 17:49:27|
+---------------+-----------+----------+-------+--------+--------+-------------------+
推荐阅读
- javascript - 如何更改对 Firestore 的调用以减少读取请求?
- ffmpeg - FFmpeg HLS 选择流并仅检索其数据
- string - 按顺序将映射元素添加到切片
- arrays - 代码在我的 ide 中运行良好,但在 GeeksforGeeks 上提交后没有得到输出
- batch-file - 使用路径变量时 Forfiles 复制失败
- c# - xml 和 xsd 的架构验证失败
- angular - ag-grid-community:服务器端分页的无限行模型,社区免费版 agGrid - 不像服务器端分页那样工作
- c++ - 为什么我需要重新声明重载的虚函数?
- flutter - 如何在 ListView 中添加 DraggableScrollableSheet
- ios - 使用嵌套结构在 Swift 中解码深度嵌套的对象