首页 > 解决方案 > 加入两个数据帧后的 Pyspark 重复

问题描述

我有两个 Pyspark df

df1

TransID  Date     custusername
1        11/01      1A
2        11/01      1A
3        11/02      1A
4        11/02      1A
5        11/03      1A

df2

custusername   Date    CustID
1A             11/01    xx1
1A             11/02    xx1
1A             11/03    xx2

加入两个数据帧并计数后的所需输出

Date   CustID   Count
11/01   xx1      2
11/02   xx1      2
11/03   xx2      1

我得到的实际输出是

11/01   xx1      2
11/01   xx2      2
11/02   xx1      2
11/02   xx2      2
11/03   xx1      1
11/03   xx2      1

因为 CustID 在 11/03 更新,所以我的计数重复了。

我的代码

join = [df1.custusername == df2.custusername]
joined = df1.join(df2, join, "inner")

标签: pythondataframejoinpyspark

解决方案


有两个数据框:

df1 = spark.createDataFrame([
    (1, "11/01", "1A"),
    (2, "11/01", "1A"),
    (3, "11/02", "1A"),
    (4, "11/02", "1A"),
    (5, "11/03", "1A"),
], schema=['TransId', 'Date', 'custusername'])
df1.show()
+-------+-----+------------+
|TransId| Date|custusername|
+-------+-----+------------+
|      1|11/01|          1A|
|      2|11/01|          1A|
|      3|11/02|          1A|
|      4|11/02|          1A|
|      5|11/03|          1A|
+-------+-----+------------+
df2 = spark.createDataFrame([
    ("1A", "11/01", "xx1"),
    ("1A", "11/02", "xx1"),
    ("1A", "11/03", "xx2"),
], schema=['custusername', 'Date', 'CustId'])
df2.show()
+------------+-----+------+
|custusername| Date|CustId|
+------------+-----+------+
|          1A|11/01|   xx1|
|          1A|11/02|   xx1|
|          1A|11/03|   xx2|
+------------+-----+------+

我将按 和 对第一个 DataFrame 进行Date分组custusername

df1_group = df1.groupBy('Date', 'custusername').count()
df1_group.show()
+-----+------------+-----+
| Date|custusername|count|
+-----+------------+-----+
|11/01|          1A|    2|
|11/03|          1A|    1|
|11/02|          1A|    2|
+-----+------------+-----+

然后只需加入df2

df = df1_group.join(df2, on=['custusername', 'Date'], how='left')
df.show()
+------------+-----+-----+------+
|custusername| Date|count|CustId|
+------------+-----+-----+------+
|          1A|11/01|    2|   xx1|
|          1A|11/03|    1|   xx2|
|          1A|11/02|    2|   xx1|
+------------+-----+-----+------+

推荐阅读