首页 > 解决方案 > 需要对带有查找数据框的数据框执行多列连接

问题描述

我有两个这样的数据框

+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
|  0|  1|  2|  3|  4|
|  5|  6|  7|  8|  9|
+---+---+---+---+---+
+---+---+
|key|val|
+---+---+
|  0|  A|
|  1|  B|
|  2|  C|
|  3|  D|
|  4|  E|
|  5|  F|
|  6|  G|
|  7|  H|
|  8|  I|
|  9|  J|
+---+---+

我想用 df2 中的等效键查找 df1 上的每一列,并从 df2 返回查找 val。

这是生成两个输入数据帧的代码

df1 = sc.parallelize([('0','1','2','3','4',), ('5','6','7','8','9',)]).toDF(['c1','c2','c3','c4','c5'])
df1.show()
df2 = sc.parallelize([('0','A',), ('1','B', ),('2','C', ),('3','D', ),('4','E',),\
                     ('5','F',), ('6','G', ),('7','H', ),('8','I', ),('9','J',)]).toDF(['key','val'])
df2.show()


I want to join the above to produce the following

+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
|  0|  1|  2|  3|  4|A  |B  |C  |D  |E  |
|  5|  6|  7|  8|  9|F  |G  |H  |I  |J  |
+---+---+---+---+---+---+---+---+--+----+

我可以让它像这样适用于单个列,但我不确定如何将它扩展到所有列

df1.join(df2, df1.c1 == df2.key).select('c1','val').show()

+---+---+
| c1|val|
+---+---+
|  0|  A|
|  5|  F|
+---+---+

标签: dataframejoinpyspark

解决方案


您可以链接连接:

df1
    .join(df2, on=df1.c1 == df2.key, how='left')
    .withColumnRenamed('val', 'lu1') \
    .join(df2, on=df1.c2 == df2.key, how='left) \
    .withColumnRenamed('val', 'lu2') \
    .etc

您甚至可以循环执行此操作,但不要使用太多列:

from pyspark.sql import functions as f

df = df1
for i in range(1, 6):
    df = df \
        .join(df2.alias(str(i)), on=f.col('c{}'.format(i)) == f.col("{}.key".format(i)), how='left') \
        .withColumnRenamed('val', 'lu{}'.format(i))

df \
    .select('c1', 'c2', 'c3', 'c4', 'c5', 'lu1', 'lu2', 'lu3', 'lu4', 'lu5') \
    .show()

输出

+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
|  5|  6|  7|  8|  9|  F|  G|  H|  I|  J|
|  0|  1|  2|  3|  4|  A|  B|  C|  D|  E|
+---+---+---+---+---+---+---+---+---+---+

推荐阅读