首页 > 解决方案 > SQL BigQuery 资源超出了对大表(1 亿行)进行 LEFT JOIN 的操作,我应该怎么做才能获得良好的性能/优化?

问题描述

我想从我的客户表(1 亿行)和我的活动表(2000 行)中检索信息。我一直将活动表用作LEFT 表,因为我想要那里的所有数据(我知道这COUNT(DISTINCT)对良好的性能没有帮助,但我不知道 DISTINCT 之外的其他子句)。

我一直在使用这个查询:

SELECT 
  COUNT(DISTINCT
    CASE
     WHEN CLIENT.date <= CAMP.end_date AND CLIENT.date >= CAMP.start_date AND CLIENT.freq <= 1 THEN client_id
END
) AS new_client,

FROM CAMP
 LEFT JOIN CLIENT
 ON CAMP.country=CLIENT.country
新客户
1 6543210

这个单一的查询花了我 23 分钟来检索数据。这只是我真正想要使用的查询的一部分(在下一个查询中我有错误“resourcesExceeded 400”)。我没有使用WHERE子句,因为我需要所有的客户。看看他们是新客户还是老客户。

我真正想使用的查询是这个:

SELECT 

 COUNT(DISTINCT CASE WHEN CLIENT.date <= CAMP.end_date AND CLIENT.date >= CAMP.start_date  AND CLIENT.freq <= 1 THEN client_id  END) AS new_client,

COUNT(DISTINCT CASE WHEN CLIENT.date <= CAMP.end_date AND CLIENT.date >= CAMP.start_date  AND CLIENT.freq > 1 THEN client_id  END) AS old_client,

COUNT(DISTINCT CASE WHEN CLIENT.date <= CAMP.end_date AND CLIENT.date >= CAMP.start_date AND CLIENT.country IN ("USA","EUROPA","MEXICO","OCEANIA","ASIA") AND  freq = 1 THEN client_id END ) + COUNT(DISTINCT CASE WHEN CLIENT.date <= CAMP.end_date AND CLIENT.date >= CAMP.start_date AND CLIENT.country NOT IN ("USA","EUROPA","MEXICO","OCEANIA","ASIA") AND  year_freq >= 1 AND  year_freq <= 3 THEN client_id END ) AS low_activity,

COUNT(DISTINCT CASE WHEN CLIENT.date <= CAMP.end_date AND CLIENT.date >= CAMP.start_date AND CLIENT.country IN ("USA","EUROPA","MEXICO","OCEANIA","ASIA") AND  freq = 2 THEN client_id END ) + COUNT(DISTINCT CASE WHEN CLIENT.date <= CAMP.end_date AND CLIENT.date >= CAMP.start_date AND CLIENT.country NOT IN ("USA","EUROPA","MEXICO","OCEANIA","ASIA") AND  year_freq > 3 AND  year_freq <= 6 THEN client_id END ) AS medium_activity,

COUNT(DISTINCT CASE WHEN CLIENT.date <= CAMP.end_date AND CLIENT.date >= CAMP.start_date AND CLIENT.country IN ("USA","EUROPA","MEXICO","OCEANIA","ASIA") AND  freq >= 3 THEN client_id END ) + COUNT(DISTINCT CASE WHEN CLIENT.date <= CAMP.end_date AND CLIENT.date >= CAMP.start_date AND CLIENT.country NOT IN ("USA","EUROPA","MEXICO","OCEANIA","ASIA") AND  year_freq > 6 THEN client_id END ) AS high_activity,

FROM CAMP
LEFT JOIN CLIENT
ON CAMP.country = CLIENT.country

此查询运行 1 小时,然后因错误“超出资源”而中断。所以我问你是否有另一种方法来改进查询。

我想要的是在我做特别活动的那几周以及在老客户中的medium_activity客户端和high_activity客户端中,获得新老客户的份额。new_client 将是low_activity

我 99% 确定这个查询会因为太多COUNT(DISTINCT). 作为 SQL 中的菜鸟,我不知道如何优化此查询。如果有更好的东西COUNT(DISTINCT)知道会很酷。

客户端表未使用任何分区

任何改进此查询的性能/优化的建议、意见、帮助、新子句、新语句等都是非常受欢迎的。

标签: sqlperformancegoogle-bigqueryquery-optimization

解决方案


推荐阅读