首页 > 解决方案 > 如何将 SQL 过滤器转换为 Pyspark

问题描述

我有以下 SQL:

freecourse_info_step_8 as (
-- How many questions answered correct in that
select *, 
    count(question_number) FILTER (WHERE answered = true) over(partition by hacker_rank_id, freecourse_version, question_block, freecourse_users_id) as answered_correct_in_block
from freecourse_info_step_7
),

我转换为 Pyspark

column_list = ["hacker_rank_id", "freecourse_version", "question_block", "freecourse_users_id"]
window = Window.partitionBy([f.col(x) for x in column_list])
freecourse_info_step_8 = freecourse_info_step_7.withColumn('answered_correct_in_block',
                                                           f.when(f.col('answered') == True, f.count('question_number').over(window)))

我怀疑代码与 SQL 的行为不同。我对吗?如何正确将此 SQL 转换为 PySpark?

Pyspark spark.sql() 方法不适用于 FILTER

标签: sqlapache-sparkpyspark

解决方案


freecourse_info_step_8 = freecourse_info_step_7.withColumn('answered_correct_in_block',
                                                          f.count(f.when(f.col('answered') == True, 'question_number')).over(window))

计数功能应该在条件之外


推荐阅读