首页 > 解决方案 > Join on 1 = 1 on pyspark

问题描述

Good day,

I wish to check if my variable exists in either of two tables and get the result in single table for furture processing. I figured that simple:

'''
   select concent, concent_big  from 
     (select count(*) as concent where core_id = "{}" ) as a
   left join 
     (select count(*) as concent_big ,concent_2 where core_id = "{}" ) as b
   on 1 = 1
'''

However this does not seem be allowed. It's a bit confusing, since I've done similar things in Sql previously. Now pySpark is giving me hard time. I came up with work-around, but it's (imho) silly:

 '''
    select concent, concent_big  from 
      (select count(*) as concent, 1 as tmp_key from concent where core_id = "{}" ) as a
    left join 
      (select count(*) as concent_big , 1 as tmp_key from concent_2 where core_id = "{}" ) as b
    on a.tmp_key = b.tmp_key
 '''

Any ideas how to do this more elegantly?

标签: pysparkapache-spark-sql

解决方案


为什么不直接使用crossJoin?一个警告 - 这会产生一个完整的笛卡尔积,所以你的桌子可能会在尺寸上爆炸,但这似乎是你想要的效果。你可以在这里阅读:https ://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.DataFrame.crossJoin

编辑:使用 Spark SQL 时,语言遵循ANSI SQL标准,因此命令变为CROSS JOIN.

希望这可以帮助。


推荐阅读