首页 > 解决方案 > DB2 中 WITH 查询的 SQL 查询性能改进

问题描述

我在示例中给出的查询运行速度非常慢。我已经在my_task表中关闭了 400 万条记录。

我们可以对此进行任何形式的性能改进吗?

以下表为例,

在这里,我放置了数字start_dtend_dt不是放置timestamp格式。

有空的附加说明end_dt意味着它是一个活动记录并且正在由工人处理。

T_ID |start_dt |end_dt |code       |p_id
-----|---------|-------|-----------|---
1    |8        |4      |INPROGRESS |110
1    |4        |       |ASSIGNED   |110
4    |10       |4      |INPROGRESS |110
4    |4        |       |ASSIGNED   |110
5    |4        |4      |INPROGRESS |110
6    |12       |12     |INPROGRESS |110
6    |8        |8      |ASSIGNED   |110
6    |8        |       |DONE       |110
2    |12       |12     |INPROGRESS |210
2    |8        |8      |ASSIGNED   |210
2    |8        |       |DONE       |210
3    |12       |12     |INPROGRESS |111

输出看起来像,

P_ID |avg_bgn_diff |assigned |in_progress |completed | comp_diff
-----|-------------|---------|------------|----------|----------
110  | 4           |   2     |    1       |     1    |      10
210  | null        |   0     |    0       |     1    |      8
111  | null        |   0     |    1       |     0    |      null

输出解释:我已经用虚构的名称掩盖了原始查询表 ref 可以被破坏我为此提前道歉。

我有以下查询,

WITH a AS (
    SELECT
        t1.t_id AS t_id,
        t1.start_dt AS start_dt,
        t1.end_dt AS end_dt,
        t1.code AS code,
        t2.p_id AS p_id
    FROM
        my_task t2
        INNER JOIN my_task_ref t1 ON t1.t_id = t2.t_id
        INNER JOIN my_people p1 ON t2.p_id = p1.p_id
    WHERE
        -- ignore DONE tasks
        t1.t_id NOT IN (
            SELECT t.t_id
            FROM my_task t
            WHERE t.code = 'DONE' AND trunc(t.execution_dt) < trunc(current_timestamp)
        )
        and p1.department_id = '1234' 
    ORDER BY p_id DESC
) SELECT
    d.p_id,
    d.avg_bgn_diff
    ,e.assigned
    ,e.in_progress
    ,e.completed
    ,g.comp_diff
  FROM
  `-- find average time for persons for diff ASSIGNMENT
    (
        SELECT c.p_id,AVG(c.bgn_diff) AS avg_bgn_diff
        FROM(
                SELECT b.p_id,timestampdiff(4,current_timestamp - a.start_dt) AS bgn_diff
                FROM ( SELECT p_id,t_id,start_dt FROM a WHERE end_dt IS NULL ) b
                LEFT OUTER JOIN  ( SELECT p_id, t_id,start_dt FROM a WHERE 
                     code = 'ASSIGNED' AND   end_dt IS NULL ) x ON x.p_id = b.p_id
            ) c  GROUP BY C.p_id
    ) d
    -- find count of each codes person has
    INNER JOIN (
        SELECT 
            p_id,
            SUM( CASE WHEN code = 'ASSIGNED' THEN 1 ELSE 0 END ) AS assigned,
            SUM( CASE WHEN code = 'INPROGRESS' THEN 1 ELSE 0 END ) AS in_progress,
            SUM( CASE WHEN code = 'DONE' AND trunc(start_dt) = trunc(current_timestamp)
                    THEN 1 ELSE 0 END ) AS completed
        FROM
            a where end_dt IS NULL
        GROUP BY p_id
    ) e on D.p_id=E.p_id 
    -- find total avg diff of entire task took to compelete.
    LEFT OUTER JOIN (
        SELECT F.p_id,AVG(f.bgn_diff) AS comp_diff
        FROM
            (
                SELECT a.p_id, timestampdiff(4,b.start_dt - a.start_dt) AS bgn_diff
                FROM (
                        SELECT p_id, t_id, start_dt FROM a WHERE code = 'INPROGRESS'
                    ) a
                    INNER JOIN (
                        SELECT p_id, t_id, start_dt FROM a
                        WHERE code = 'DONE' AND   trunc(start_dt) = trunc(current_timestamp)
                    ) b ON a.t_id = b.t_id
            ) f GROUP BY F.p_id
    ) g ON D.p_id=G.p_id
WITH
ur;

我们可以用不同的方式编写它来提高性能吗?

注意:索引存在于所有必要的列中。

提前致谢。

标签: sqlperformancedb2

解决方案


如果您提供一个查询EXPLAIN计划、一个索引列表,或许可以更好地解释您正在尝试做什么(并更正表引用的语法错误c),我们当然可以做得更好,但是这个版本的查询可能能够加快速度。

请注意整个评论!

WITH Incomplete_Task AS (SELECT My_Task_Ref.t_id,
                                My_Task_Ref.start_dt, My_Task_Ref.end_dt,
                                My_Task_Ref.code,
                                Task_A.p_id
                         FROM My_Task AS Task_A
                         JOIN My_Task_Ref
                           ON My_Task_Ref.t_id = Task_A.t_id
                         JOIN My_People
                           ON My_People.p_id = My_Task_Ref.p_id
                              AND My_People.department_id = '1234'
                         -- NOT IN should be fine, I just default to NOT EXISTS
                         WHERE NOT EXISTS (SELECT 1
                                           FROM My_Task AS Task_B
                                           WHERE Task_B.t_id = Task_A.t_id
                                           AND Task_B.code = 'DONE'
                                           -- Calling a function on a column can 
                                           -- cause indices to be ignored
                                           AND Task_B.execution_dt < TIMESTAMP(CURRENT_DATE)))

SELECT Average_Time_And_Code_Count.p_id,
       Average_Time_And_Code_Count.average_begin_difference,
       COALESCE(Average_Time_And_Code_Count.assigned, 0),
       COALESCE(Average_Time_And_Code_Count.in_progress, 0),
       COALESCE(Average_Time_And_Code_Count.completed, 0),
       Average_Complete_Time.average_complete_difference
FROM (SELECT p_id,
             -- The join you had previously was almost certainly duplicating 
             -- some rows, distorting the results.
             AVG(CASE WHEN code = 'ASSIGNED' 
                      -- TIMESTAMPDIFF works off an estimate, and will be wrong
                      -- if a task takes more than a month.
                      THEN TIMESTAMPDIFF(4, CURRENT_TIMESTAMP - A.start_dt) END) AS average_begin_difference,
             SUM(CASE WHEN code = 'ASSIGNED' 
                               THEN 1 END) AS assigned,
             SUM(CASE WHEN code = 'INPROGRESS' 
                               THEN 1 END) AS in_progress,
             SUM(CASE WHEN code = 'DONE' 
                                    AND start_dt >= TIMESTAMP(CURRENT_DATE) 
                               THEN 1 END) AS completed
      FROM Filtered_Task
      WHERE end_dt IS NULL
      GROUP BY p_id) AS Average_Time_And_Code_Count
-- I'm not convinced this measures what you think it does,
-- but I'm not sure what it is you think you _are_ measuring....
LEFT JOIN (SELECT p_id, TIMESTAMPDIFF(4, Done.start_dt - InProgress.start_dt) AS average_complete_difference
           FROM Filtered_Task AS InProgress
           JOIN Filtered_Task AS Done
             ON InProgress.t_id = Done.t_id
                AND Done.code = 'DONE'
                AND Done.start_dt >= TIMESTAMP(CURRENT_DATE)
           WHERE InProgress.code = 'INPROGRESS') AS Average_Complete_Time
       ON Average_Complete_Time.p_id = Averate_Time_And_Code_Count.p_id

推荐阅读