首页 > 解决方案 > How to access a column from a sub query in outer query using Spark SQL

问题描述

In spark SQL, I executed the query

select 
  Retailer_country,
  max(sold) 
from (
  select 
    Retailer_country,
    count(*) as sold 
  from ProductInfo 
  where year=2013 and Product_type='Watches' 
  group by Retailer_country
)

to a dataset having columns

Retailer_country
Order_method_type
Retailer_type
Product_line
Product_type
Product
Year
Quarter
Revenue
Quantity
Gross_margin

It returns an exception

 org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '__auto_generated_subquery_name.`Retailer_country`' is not an aggregate function. Wrap '(max(__auto_generated_subquery_name.`solds`) AS `max(solds)`)' in windowing function(s) or wrap '__auto_generated_subquery_name.`Retailer_country`' in first() (or first_value) if you don't care which value you get.;;
 Aggregate [Retailer_country#10, max(solds#77L) AS max(solds)#80L]
+- SubqueryAlias `__auto_generated_subquery_name`
+- Aggregate [Retailer_country#10], [Retailer_country#10, count(1) AS solds#77L]
  +- Filter ((cast(year#16 as int) = 2013) && (Product_type#14 = Watches))
     +- SubqueryAlias `productinfo`
        +- Relation[Retailer_country#10,Order_method_type#11,Retailer_type#12,Product_line#13,Product_type#14,Product#15,Year#16,Quarter#17,Revenue#18,Quantity#19,Gross_margin#20] csv

While I executed the same query in an online compiler with the same table structure, It returns

USA|4.

Sample Data:

USA|Fax|Outdoors Shop|Camping Equipment|Cooking Gear|TrailChef Deluxe Cook Set|2012|Q1 2012|59628.66|489|0.34754797
USA|Fax|Outdoors Shop|Camping Equipment|Cooking Gear|TrailChef Deluxe Cook Set|2013|Q1 2012|59628.66|489|0.34754797
USA|Fax|Outdoors Shop|Camping Equipment|watch|TrailChef Deluxe Cook Set|2013|Q1 2012|59628.66|489|0.34754797
USA|Fax|Outdoors Shop|Camping Equipment|watch|TrailChef Deluxe Cook Set|2013|Q1 2012|59628.66|489|0.34754797
USA|Fax|Outdoors Shop|Camping Equipment|watch|TrailChef Deluxe Cook Set|2013|Q1 2012|59628.66|489|0.34754797
UK|Fax|Outdoors Shop|Camping Equipment|watch|TrailChef Deluxe Cook Set|2013|Q1 2012|59628.66|489|0.34754797
UK|Fax|Outdoors Shop|Camping Equipment|watch|TrailChef Deluxe Cook Set|2013|Q1 2012|59628.66|489|0.34754797
USA|Fax|Outdoors Shop|Camping Equipment|watch|TrailChef Deluxe Cook Set|2013|Q1 2012|59628.66|489|0.34754797
UK|Fax|Outdoors Shop|Camping Equipment|Cooking Gear|TrailChef Deluxe Cook Set|2013|Q1 2012|59628.66|489|0.34754797
UK|Fax|Outdoors Shop|Camping Equipment|Cooking Gear|TrailChef Deluxe Cook Set|2012|Q1 2012|59628.66|489|0.34754797

How the result differs and how to display the Retailer_country in spark. Any help to solve this.

标签: javasqlapache-sparkdataset

解决方案


you have two aggreages. one for sub-query other for main query. the format of your query must be like below

select 
  Retailer_country,
  max(sold) 
from (
  select 
    Retailer_country,
    count(*) as sold 
  from ProductInfo 
  where year=2013 and Product_type='Watches' 
  group by Retailer_country)
group by Retailer_country

however, if we take a deeper look to your query, your subquery returns distinct Retailer_country because of group by aggregate. hence, you do not need to use outer max(sold) aggregate. As a result, your final query is actually this one:

select 
  Retailer_country,
  count(*) as sold 
from ProductInfo 
where year=2013 and Product_type='Watches' 
Group by Retailer_country

Edit: According to provided comment

select 
  Retailer_country,
  count(*) as sold 
from ProductInfo 
where year=2013 and Product_type='Watches' 
group by Retailer_country
Order by sold desc
limit 1

推荐阅读