首页 > 解决方案 > 如何将一个数据帧循环到另一个数据帧并在pyspark中获取单个匹配记录

问题描述

**数据框 1 **

 +----+------+------+-----+-----+
 |key  |dc_count|dc_day_count   |
 +----+------+------+-----+-----+
 | 123 |13      |66             |
 | 123 |13      |12             |
 +----+------+------+-----+-----+

        

**规则数据框**

 +----+------+------+-----+-----++------+-----+-----+
 |key  |rule_dc_count|rule_day_count   |rule_out    | 
 +----+------+------+-----+-----++------+-----+-----+
 | 123 |2            |30               |139         |
 | 123 |null         |null             |64          |
 | 124 |2            |30               |139         |
 | 124 |null         |null             |64          |
 +----+------+------+-----+-----+----+------+-----+--

如果 dc_count>rule_dc_count 和 dc_day_count > rule_day_count 填充相应的 rule_out
否则其他 rule_out"

预期产出

 +----+------+------+-
 |key  |rule_out    | 
 +----+------+------+
 | 123 | 139        |
 | 124 |  64        |
 +----+------+------+

标签: pythonapache-sparkpyspark

解决方案


Assuming expected output as-

+---+--------+
|key|rule_out|
+---+--------+
|123|139     |
+---+--------+

Below query should work-

spark.sql(
      """
        |SELECT
        | t1.key, t2.rule_out
        |FROM table1 t1 join table2 t2 on t1.key=t2.key and
        |t1.dc_count > t2.rule_dc_count and t1.dc_day_count > t2.rule_day_count
      """.stripMargin)
      .show(false)

推荐阅读