首页 > 解决方案 > How to drop rows of a pyspark dataframe if they're in another dataframe based on the values from two columns?

问题描述

I have two dataframe, one with user and item columns, and another with all user item pairs and their scores. user| item and user | item | item2 | rating2 | score

I want to remove all the rows from the second table where the user and item appear in the first dataframe. I can't use subtract since they aren't the same number of columns?

Is this something that could be accomplished with an anti join?

标签: pythonsqlpython-3.xapache-sparkpyspark

解决方案


df2.join(df1, on=['user', 'item'], how="left_anti")

推荐阅读