首页 > 解决方案 > Pyspark 计算一个项目在数据框中不同日期出现的次数

问题描述

假设我有一个日期框架,例如

date         offer        member
2020-01-01    o1           m1
2020-01-01    o2           m1
2020-01-01    o1           m2
2020-01-01    o2           m2
2020-01-02    o1           m3
2020-01-02    o2           m3
2020-01-03    o1           m4

我应该计算一个报价存在多少天

date         offer        member    count
2020-01-01    o1           m1       3
2020-01-01    o2           m1       2
2020-01-01    o1           m2       3
2020-01-01    o2           m2       2
2020-01-02    o1           m3       3
2020-01-02    o2           m3       2
2020-01-03    o1           m4       3

有人可以帮助我如何在 pyspark 中执行此操作,因为我是新手。

标签: pythonapache-sparkapache-spark-sql

解决方案


      val source1DF = Seq(
    ("2020-01-01", "o1", "m1"),
    ("2020-01-01", "o2", "m1"),
    ("2020-01-01", "o1", "m2"),
    ("2020-01-01", "o2", "m2"),
    ("2020-01-02", "o1", "m3"),
    ("2020-01-02", "o2", "m3"),
    ("2020-01-03", "o1", "m4")
  ).toDF("date", "offer", "member")

  val tmp1DF = source1DF.select('date, 'offer).dropDuplicates()
  val tmp2DF = tmp1DF.groupBy("offer").count.alias("count")

  val resultDF = source1DF
    .join(tmp2DF, source1DF.col("offer") === tmp2DF.col("offer"))
    .select(
      source1DF.col("date"),
      source1DF.col("offer"),
      source1DF.col("member"),
      tmp2DF.col("count")
    )

  resultDF.show(false)
  //  +----------+-----+------+-----+
  //  |date      |offer|member|count|
  //  +----------+-----+------+-----+
  //  |2020-01-01|o1   |m1    |3    |
  //  |2020-01-01|o2   |m1    |2    |
  //  |2020-01-01|o1   |m2    |3    |
  //  |2020-01-01|o2   |m2    |2    |
  //  |2020-01-02|o1   |m3    |3    |
  //  |2020-01-02|o2   |m3    |2    |
  //  |2020-01-03|o1   |m4    |3    |
  //  +----------+-----+------+-----+

推荐阅读