首页 > 解决方案 > 大数据和大分组的优化查询

问题描述

我有一个要优化的查询。它们由大量的分组和连接组成。最初,查询是先加入并执行 GROUP BY 我想先对列进行分组,然后再加入剩余的列。

当他执行 JOIN 操作时出现问题,因为 GROUP BY 中没有使用连接列。所以,我不知道如何优化它

SELECT  
            a.create_datetime_date,
            a.company_code,
            a.system_code,
            a.type_id,
            a.status_id,
            a.response_id,
            a.subject_id,
            a.providers_channels_id,
            a.currency,
            a.complaint,
            a.complaint_type,
            a.returned,
-- online
            a.api_type,
            --b.source,
            a.device,
            a.chk_validated,
            a.country,
            a.customer, 
            a.application, 
            a.application_version, 
            a.language,
            a.intercompany,
-- cards
            g.card_brand,
            g.card_type,
            g.mpi_result,
            g.three_ds_type,
            g.operation_category,
            g.credit_card_operation_type,
            g.issuer_country,
-- pos
            a.location_id,
            a.terminal_id,
-- provider_date
            b.subject_id,
            b.providers_channels_id,
            c.card_brand,
            c.card_type,
            c.issuer_country,
            c.three_ds_type,
            c.operation_category,
            c.credit_card_operation_type,
-- agr
             a.trans_count,
             a.trans_value,
             a.turnover_pln,
             a.income_pln,
             a.cost_pln,
             a.time_to_status,
            a.id_array,
            'DAILY_NEW'
--3869958
    FROM    ( SELECT 
            z1.create_datetime_date,
            z1.company_code,
            z1.system_code,
            z1.type_id,
            z1.status_id,
            z1.response_id,
            z1.subject_id,
            z1.providers_channels_id,
            z1.currency,
            z1.complaint,
            z1.complaint_type,
            z1.returned,
            z1.api_type,
            z1.device,
            z1.chk_validated,
            z1.country,
            z1.customer, 
            z1.application, 
            z1.application_version, 
            z1.language,
            z1.intercompany,
            z1.location_id,
            z1.terminal_id,
            count(z1.id) as trans_count,
            sum(z1.value_pln) as trans_value,
            sum(z1.turnover_pln) as turnover_pln,
            sum(z1.income_pln)  as income_pln,
            sum(z1.cost_pln)    as cost_pln,
            sum(z1.extract_epoch) as time_to_status,
            array_agg(z1.id) as  id_array,
             FROM risk.transactions_for_test z1
    WHERE   z1.create_datetime          >= date_trunc('month', date '2020-06-30') - interval '1 month' * 4  AND
            z1.create_datetime          < '2020-06-30'                      AND
            z1.company_code             in ('dotpay')
             
        GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 
    ) a
    LEFT JOIN risk.transactions b on a.provider_transaction_id=substring(b.external_id, length(b.company_code)+length(b.system_code)+3)
    LEFT JOIN risk.transactions_statuses c  on b.id=c.transaction_id and c.is_last=TRUE
    LEFT JOIN risk.transactions_statuses g  on a.id=g.transaction_id
    LEFT JOIN risk.dict_statuses    e       on a.status_id=e.id
    WHERE g.is_last =TRUE   

您如何看到,首先我对表 A 中的列进行分组和聚合,然后我想加入另一个表,但我在表 A 中没有强制列(它是带有分组的子查询)(例如 a.provider_transaction_id 和。ID

编辑:

解释分析缓冲区的输出:

"GroupAggregate  (cost=26623251.90..29880446.29 rows=19159967 width=527) (actual time=731329.744..780749.029 rows=3869958 loops=1)"
"  Group Key: ((a.create_datetime)::date), a.company_code, a.system_code, a.type_id, a.status_id, a.response_id, a.subject_id, a.providers_channels_id, a.currency, a.complaint, a.complaint_type, a.returned, a.api_type, (CASE WHEN (upper((a.user_agent)::text) ~~ '%ANDROID%'::text) THEN 'Mobile'::text WHEN (upper((a.user_agent)::text) ~~ '%IPHONE%'::text) THEN 'Mobile'::text WHEN (upper((a.user_agent)::text) ~~ '%IPAD%'::text) THEN 'Mobile'::text WHEN (upper((a.user_agent)::text) ~~ '%WINDOWS%'::text) THEN 'Desktop'::text WHEN (upper((a.user_agent)::text) ~~ '%MACINTOSH%'::text) THEN 'Desktop'::text ELSE 'Other'::text END), a.chk_validated, a.country, a.customer, a.application, a.application_version, a.language, a.intercompany, g.card_brand, g.card_type, g.mpi_result, g.three_ds_type, g.operation_category, g.credit_card_operation_type, g.issuer_country, a.location_id, a.terminal_id, b.subject_id, b.providers_channels_id, c.card_brand, c.card_type, c.issuer_country, c.three_ds_type, c.operation_category, c.credit_card_operation_type"
"  Buffers: shared hit=7974752 read=13700294, temp read=3013159 written=4126575"
"  ->  Sort  (cost=26623251.90..26671151.82 rows=19159967 width=365) (actual time=731329.710..761678.063 rows=33047423 loops=1)"
"        Sort Key: ((a.create_datetime)::date), a.system_code, a.type_id, a.status_id, a.response_id, a.subject_id, a.providers_channels_id, a.currency, a.complaint, a.complaint_type, a.returned, a.api_type, (CASE WHEN (upper((a.user_agent)::text) ~~ '%ANDROID%'::text) THEN 'Mobile'::text WHEN (upper((a.user_agent)::text) ~~ '%IPHONE%'::text) THEN 'Mobile'::text WHEN (upper((a.user_agent)::text) ~~ '%IPAD%'::text) THEN 'Mobile'::text WHEN (upper((a.user_agent)::text) ~~ '%WINDOWS%'::text) THEN 'Desktop'::text WHEN (upper((a.user_agent)::text) ~~ '%MACINTOSH%'::text) THEN 'Desktop'::text ELSE 'Other'::text END), a.chk_validated, a.country, a.customer, a.application, a.application_version, a.language, a.intercompany, g.card_brand, g.card_type, g.mpi_result, g.three_ds_type, g.operation_category, g.credit_card_operation_type, g.issuer_country, a.location_id, a.terminal_id, b.subject_id, b.providers_channels_id, c.card_brand, c.card_type, c.issuer_country, c.three_ds_type, c.operation_category, c.credit_card_operation_type"
"        Sort Method: external merge  Disk: 4159856kB"
"        Buffers: shared hit=7974752 read=13700294, temp read=3013159 written=4126575"
"        ->  Gather  (cost=19135164.08..22426189.66 rows=19159967 width=365) (actual time=591167.903..639688.023 rows=33047423 loops=1)"
"              Workers Planned: 4"
"              Workers Launched: 4"
"              Buffers: shared hit=7974752 read=13700294, temp read=2493177 written=3606590"
"              ->  Parallel Hash Left Join  (cost=19134164.08..20509192.96 rows=4789992 width=365) (actual time=583965.274..621130.313 rows=6609485 loops=5)"
"                    Hash Cond: (b.id = c.transaction_id)"
"                    Buffers: shared hit=7974752 read=13700294, temp read=2493177 written=3606590"
"                    ->  Merge Left Join  (cost=17121862.99..18283927.65 rows=4789992 width=397) (actual time=577937.423..599347.150 rows=6609485 loops=5)"
"                          Merge Cond: ((a.provider_transaction_id)::text = (""substring""((b.external_id)::text, ((length((b.company_code)::text) + length((b.system_code)::text)) + 3))))"
"                          Buffers: shared hit=7496767 read=12528129, temp read=2493177 written=3606590"
"                          ->  Sort  (cost=5822528.38..5832924.28 rows=4158360 width=396) (actual time=89454.725..91606.161 rows=6609485 loops=5)"
"                                Sort Key: a.provider_transaction_id"
"                                Sort Method: external merge  Disk: 1415328kB"
"                                Worker 0:  Sort Method: external merge  Disk: 1402632kB"
"                                Worker 1:  Sort Method: external merge  Disk: 1443424kB"
"                                Worker 2:  Sort Method: external merge  Disk: 1406288kB"
"                                Worker 3:  Sort Method: external merge  Disk: 1418312kB"
"                                Buffers: shared hit=478054 read=4234477, temp read=885748 written=885753"
"                                ->  Parallel Hash Join  (cost=3454200.51..5365366.95 rows=4158360 width=396) (actual time=58629.118..82044.806 rows=6609485 loops=5)"
"                                      Hash Cond: (g.transaction_id = a.id)"
"                                      Buffers: shared hit=478034 read=4234477"
"                                      ->  Parallel Seq Scan on transactions_statuses g  (cost=0.00..1884282.54 rows=10241484 width=58) (actual time=0.025..19525.597 rows=8172165 loops=5)"
"                                            Filter: is_last"
"                                            Rows Removed by Filter: 10567399"
"                                            Buffers: shared hit=478033 read=1172005"
"                                      ->  Parallel Hash  (cost=3387908.45..3387908.45 rows=5303365 width=346) (actual time=58597.628..58597.628 rows=6609485 loops=5)"
"                                            Buckets: 33554432  Batches: 1  Memory Usage: 8003456kB"
"                                            Buffers: shared hit=1 read=3062472"
"                                            ->  Parallel Seq Scan on transactions a  (cost=0.00..3387908.45 rows=5303365 width=346) (actual time=0.061..54622.612 rows=6609485 loops=5)"
"                                                  Filter: ((create_datetime < '2020-06-30 00:00:00'::timestamp without time zone) AND ((company_code)::text = 'dotpay'::text) AND (create_datetime >= (date_trunc('month'::text, ('2020-06-30'::date)::timestamp with time zone) - '4 mons'::interval)))"
"                                                  Rows Removed by Filter: 3804450"
"                                                  Buffers: shared hit=1 read=3062472"
"                          ->  Materialize  (cost=11299334.60..11559682.96 rows=52069672 width=53) (actual time=488480.181..503766.021 rows=18586727 loops=5)"
"                                Buffers: shared hit=7018713 read=8293652, temp read=1607429 written=2720837"
"                                ->  Sort  (cost=11299334.60..11429508.78 rows=52069672 width=53) (actual time=488480.119..502399.521 rows=18586727 loops=5)"
"                                      Sort Key: (""substring""((b.external_id)::text, ((length((b.company_code)::text) + length((b.system_code)::text)) + 3)))"
"                                      Sort Method: external merge  Disk: 4353304kB"
"                                      Worker 0:  Sort Method: external merge  Disk: 4353304kB"
"                                      Worker 1:  Sort Method: external merge  Disk: 4353312kB"
"                                      Worker 2:  Sort Method: external merge  Disk: 4353304kB"
"                                      Worker 3:  Sort Method: external merge  Disk: 4353312kB"
"                                      Buffers: shared hit=7018713 read=8293652, temp read=1607429 written=2720837"
"                                      ->  Seq Scan on transactions b  (cost=0.00..3583169.72 rows=52069672 width=53) (actual time=42.577..106061.723 rows=52069673 loops=5)"
"                                            Buffers: shared hit=7018713 read=8293652"
"                    ->  Parallel Hash  (cost=1884282.54..1884282.54 rows=10241484 width=56) (actual time=5992.972..5992.972 rows=8172165 loops=5)"
"                          Buckets: 67108864  Batches: 1  Memory Usage: 2494880kB"
"                          Buffers: shared hit=477873 read=1172165"
"                          ->  Parallel Seq Scan on transactions_statuses c  (cost=0.00..1884282.54 rows=10241484 width=56) (actual time=1247.782..3608.702 rows=8172165 loops=5)"
"                                Filter: is_last"
"                                Rows Removed by Filter: 10567399"
"                                Buffers: shared hit=477873 read=1172165"
"Planning Time: 5.222 ms"
"JIT:"
"  Functions: 175"
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 27.114 ms, Inlining 222.291 ms, Optimization 3565.200 ms, Emission 2446.257 ms, Total 6260.862 ms"
"Execution Time: 781253.458 ms"

编辑 2:我要优化的默认查询:

SELECT  
            a.create_datetime::date,
            a.company_code,
            a.system_code,
            a.type_id,
            a.status_id,
            a.response_id,
            a.subject_id,
            a.providers_channels_id,
            a.currency,
            a.complaint,
            a.complaint_type,
            a.returned,
-- online
            a.api_type,
            --b.source,
            case 
                when upper(a.user_agent) like '%ANDROID%'   then 'Mobile'
                when upper(a.user_agent) like '%IPHONE%'    then 'Mobile'
                when upper(a.user_agent) like '%IPAD%'  then 'Mobile'
                when upper(a.user_agent) like '%WINDOWS%'   then 'Desktop'
                when upper(a.user_agent) like '%MACINTOSH%' then 'Desktop'
                else 'Other'
            end,
            a.chk_validated,
            a.country,
            a.customer, 
            a.application, 
            a.application_version, 
            a.language,
            a.intercompany,
-- cards
            g.card_brand,
            g.card_type,
            g.mpi_result,
            g.three_ds_type,
            g.operation_category,
            g.credit_card_operation_type,
            g.issuer_country,
-- pos
            a.location_id,
            a.terminal_id,
-- provider_date
            b.subject_id,
            b.providers_channels_id,
            c.card_brand,
            c.card_type,
            c.issuer_country,
            c.three_ds_type,
            c.operation_category,
            c.credit_card_operation_type,
-- agr
            count(a.id) as trans_count,
            sum(a.value_pln) as trans_value,
            sum(a.turnover_pln) as turnover_pln,
            sum(a.income_pln)   as income_pln,
            sum(a.cost_pln)     as cost_pln,
            sum(EXTRACT(EPOCH FROM (a.change_datetime - a.create_datetime))) as time_to_status,
            array_agg(a.id),
            'DAILY_NEW'

    FROM    risk.transactions a
    LEFT JOIN risk.transactions b on a.provider_transaction_id=substring(b.external_id, length(b.company_code)+length(b.system_code)+3)
    LEFT JOIN risk.transactions_statuses c  on b.id=c.transaction_id and c.is_last=TRUE
    LEFT JOIN risk.transactions_statuses g  on a.id=g.transaction_id
    LEFT JOIN risk.dict_statuses    e       on a.status_id=e.id
    WHERE   a.create_datetime           >= date_trunc('month', date '2020-06-30') - interval '1 month' * 4  AND
            a.create_datetime           < '2020-06-30'                      AND
            a.company_code              in ('dotpay')   AND
            g.is_last                   =TRUE
    GROUP by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38)  ;

标签: postgresqloptimizationgroup-by

解决方案


你认为那样会更快,但 PostgreSQL 不会。它认为分组实际上不会删除任何行(前后均为 19159967),因此将分组向下推看起来很无趣。

要强制它以您想要的方式运行,您可以将名为“a”的子查询从查询​​主体中取出并将其放入 CTE。那是:

WITH a AS MATERIALIZED (<your current subquery a>)
SELECT ... FROM a
LEFT JOIN...

MATERIALIZED仅从 PostgreSQL 12 起才需要关键字。在此之前,它总是会实现。

这实际上会更快吗?不知道,试试看。


推荐阅读