首页 > 解决方案 > Pyspark - Selecting rows in dataframe based on values that exist in another dataframe

问题描述

Assume these two pyspark dataframes:

dfA

id
1
2
3
4

dfB

src,dst
2  ,3
1  ,3
3  ,4
4  ,1
7  ,3
1  ,8

How can I get this desired output:

resultDf

src,dst
2  ,3
1  ,3
3  ,4
4  ,1

Basically I want to select Rows from dfB that contain a value of dfA

标签: pythonpyspark

解决方案


我能够使用spark.sql

resultDf = spark.sql("SELECT * FROM dfA WHERE dfB.src IN (SELECT * FROM dfA) AND dfB.dst IN (SELECT * FROM dfA)")

推荐阅读