首页 > 解决方案 > postgresql - 尝试在范围标准内对类似数据进行分组

问题描述

嗨——我的 sql 技能不是最好的,但我被分配到一个深入数据的项目。基本上,我需要根据多个标准为类似项目分配一个组值。我可以很好地匹配 2 列,但只有当行的值落在彼此的特定范围内时,第 3 列和第 4 列才适用。

示例标准

 1. Match Item Code on first 4 digits after leading zeros.
 2. Match Description based on first word only.
 3. Outlet Number, only group if value is +/- 200 of each other
 4. Cost, only group if value is within .75 of each other

“组”值应该是这样的(这完全是虚构的数据):

数据示例] 1 如果上面的链接不起作用,这里是数据的文本:


Item_Code   Description   Cat_Number Cost  Group
123401      HM Book Green   5000    15.50   1
0123402     HM Book Blue    5200    14.75   2
123403      HM Book Orange  5100    14.75   2
0123404     HM Book Red     4700    14.75   3
123405      HM Book Yellow  5200    14.50   2
798321      Car Light Bulb  4000    5.25    4
798322      Car Light Lens  4610    5.50    5
798330      Car Light fuse  4600    4.75    5
0074789     Pencil #2       9900    1.25    6
747889      Pencil          9800    0.99    6
747890      Pencil #1       9100    1.10    7
074788      Pencils         9345    1.99    8
074785      Pencil B        8000    1.50    9

这是我前两列的内容,效果很好,甚至似乎在成本列上工作,但如果在 200 + / - 彼此的范围 - 我尝试过的没有任何工作:

    SELECT SUBSTR(TRIM(item_code, '0'), 1, 4), 
         SPLIT_PART(description, ' ', 1), 
         cost::numeric(19,2),
         DENSE_RANK() OVER (ORDER BY SUBSTR(TRIM(item_code, '0'), 1, 4), 
         SPLIT_PART(description, ' ', 1), cost::numeric(19,2)) AS group_number
    FROM some_items
    GROUP BY SUBSTR(TRIM(item_code, '0'), 1, 4), 
            SPLIT_PART(description, ' ', 1),  
            (
               (cost = cost +.75), 
               (cost = cost -.75)
            ), cost
    ORDER BY item_code;

我尝试了与 Cost 相同的逻辑,我尝试了公用表表达式和连接、嵌套选择、Having 子句的不同组合——但一切都回到我想以某种方式在组中合并一个范围 for猫号也是。成本似乎正在发挥作用也可能是一种侥幸——但到目前为止,它看起来一直是准确的。我非常感谢任何人可以提供的任何帮助或指导。谢谢!!

标签: postgresql

解决方案


我找到了诀窍!我得到的组,即使它们没有与示例相同的组 ID,也由相同的对象组成。

生成的查询既丑陋又复杂,我会在这个答案的最后把它带给你,但实际上最有趣的是这个技巧。

一开始,为了简洁起见,我将只考虑列cat_number(因为成本相同);在那之后,我会把所有的和平放在一起。

让我们一步一步来!

 select item_code, description, cat_number, 
 cat_number - lag(cat_number,1,cat_number) over(order by cat_number) as le 
 from table1;

我们有这个输出:

| item_code | description    | cat_number | le   |
| --------- | -------------- | ---------- | ---- |
| 798321    | Car Light Bulb | 4000       | 0    |
| 798330    | Car Light fuse | 4600       | 600  |
| 798322    | Car Light Lens | 4610       | 10   |
| 123404    | HM Book Red    | 4700       | 90   |
| 123401    | HM Book Green  | 5000       | 300  |
| 123403    | HM Book Orange | 5100       | 100  |
| 123405    | HM Book Yellow | 5200       | 100  |
| 123402    | HM Book Blue   | 5200       | 0    |
| 74785     | Pencil B       | 8000       | 2800 |
| 747890    | Pencil #1      | 9100       | 1100 |
| 74788     | Pencils        | 9345       | 245  |
| 747889    | Pencil         | 9800       | 455  |
| 74789     | Pencil #2      | 9900       | 100  |

lecat_number包含每一行的值与前一行的值之间的差异。

再一步:

select item_code
         , description
         , cat_number
         , case 
                when le >= 200 then cat_number 
                else null 
            end as lle from(
                    select item_code, description, cat_number, 
                    cat_number - lag(cat_number,1,cat_number) over(order by cat_number) as le 
                    from table1
            ) s1;

| item_code | description    | cat_number | lle  |
| --------- | -------------- | ---------- | ---- |
| 798321    | Car Light Bulb | 4000       |      |
| 798330    | Car Light fuse | 4600       | 4600 |
| 798322    | Car Light Lens | 4610       |      |
| 123404    | HM Book Red    | 4700       |      |
| 123401    | HM Book Green  | 5000       | 5000 |
| 123403    | HM Book Orange | 5100       |      |
| 123405    | HM Book Yellow | 5200       |      |
| 123402    | HM Book Blue   | 5200       |      |
| 74785     | Pencil B       | 8000       | 8000 |
| 747890    | Pencil #1      | 9100       | 9100 |
| 74788     | Pencils        | 9345       | 9345 |
| 747889    | Pencil         | 9800       | 9800 |
| 74789     | Pencil #2      | 9900       |      |

lle仅当对应的值cat_number不在与前列相同的范围内时,我们才有一个值。

再一步

    select item_code
             , description
             , cat_number
             , max(lle) over (order by cat_number) from (
                     select *
                     , case 
                            when le >= 200 then cat_number 
                            else null 
                        end as lle from(
                                select item_code, description, cat_number, 
                                cat_number - lag(cat_number,1,cat_number) over(order by cat_number) as le 
                                from table1
                        ) s1
               )s2;

| item_code | description    | cat_number | max  |
| --------- | -------------- | ---------- | ---- |
| 798321    | Car Light Bulb | 4000       |      |
| 798330    | Car Light fuse | 4600       | 4600 |
| 798322    | Car Light Lens | 4610       | 4600 |
| 123404    | HM Book Red    | 4700       | 4600 |
| 123401    | HM Book Green  | 5000       | 5000 |
| 123403    | HM Book Orange | 5100       | 5000 |
| 123405    | HM Book Yellow | 5200       | 5000 |
| 123402    | HM Book Blue   | 5200       | 5000 |
| 74785     | Pencil B       | 8000       | 8000 |
| 747890    | Pencil #1      | 9100       | 9100 |
| 74788     | Pencils        | 9345       | 9345 |
| 747889    | Pencil         | 9800       | 9800 |
| 74789     | Pencil #2      | 9900       | 9800 |

我们用迄今为止遇到的最大值填充空值......现在我们可以使用dense_rank()(正如你所猜测的)

    select item_code
             , description
             , cat_number
             , dense_rank() over (order by llle) from (
                             select *
                             , max(lle) over (order by cat_number) as llle from (
                                             select *
                                             , case 
                                                    when le >= 200 then cat_number 
                                                    else null 
                                                end as lle from(
                                                        select item_code, description, cat_number, 
                                                        cat_number - lag(cat_number,1,cat_number) over(order by cat_number) as le 
                                                        from table1
                                                ) s1
                               )s2
               )s3;

| item_code | description    | cat_number | dense_rank |
| --------- | -------------- | ---------- | ---------- |
| 798330    | Car Light fuse | 4600       | 1          |
| 798322    | Car Light Lens | 4610       | 1          |
| 123404    | HM Book Red    | 4700       | 1          |
| 123405    | HM Book Yellow | 5200       | 2          |
| 123402    | HM Book Blue   | 5200       | 2          |
| 123401    | HM Book Green  | 5000       | 2          |
| 123403    | HM Book Orange | 5100       | 2          |
| 74785     | Pencil B       | 8000       | 3          |
| 747890    | Pencil #1      | 9100       | 4          |
| 74788     | Pencils        | 9345       | 5          |
| 74789     | Pencil #2      | 9900       | 6          |
| 747889    | Pencil         | 9800       | 6          |
| 798321    | Car Light Bulb | 4000       | 7          |

我们在同一组中只有cat_number指定范围内的值!

现在将所有内容放在同一个查询中!

 select item_code
             , description
             , cat_number
             , cost
             , dense_rank() over (
                        order by 
                            substring(item_code::text, 0, 4),
                            substring(description, 0, 4),
                            llle,
                            lllc) 
              from (
                 select *
                 , max(lle) over (order by cat_number) as llle
                 , max(llc) over (order by cost) lllc from (
                         select *
                         , case 
                                when le >= 200 then cat_number 
                                else null 
                            end as lle
                         , case
                               when lc >= 0.75 then cost
                                else null
                            end as llc
                           from(
                                    select item_code, description, cat_number,cost, 
                                    cat_number - lag(cat_number,1,cat_number) over(order by cat_number) as le,
                                    cost - lag(cost, 1, cost) over (order by cost) as lc
                                    from table1
                            ) s1
                    )s2
               )s3;

| item_code | description    | cat_number | cost  | dense_rank |
| --------- | -------------- | ---------- | ----- | ---------- |
| 123404    | HM Book Red    | 4700       | 14.75 | 1          |
| 123402    | HM Book Blue   | 5200       | 14.75 | 2          |
| 123405    | HM Book Yellow | 5200       | 14.50 | 2          |
| 123403    | HM Book Orange | 5100       | 14.75 | 2          |
| 123401    | HM Book Green  | 5000       | 15.50 | 3          |
| 74785     | Pencil B       | 8000       | 1.50  | 4          |
| 747890    | Pencil #1      | 9100       | 1.10  | 5          |
| 74788     | Pencils        | 9345       | 1.99  | 6          |
| 747889    | Pencil         | 9800       | 0.99  | 7          |
| 74789     | Pencil #2      | 9900       | 1.25  | 7          |
| 798322    | Car Light Lens | 4610       | 5.50  | 8          |
| 798330    | Car Light fuse | 4600       | 4.75  | 8          |
| 798321    | Car Light Bulb | 4000       | 5.25  | 9          |

Ps:item_code开头没有0,我用了一个子字符串而不是你更正确的表达,但概念是一样的。

再见


推荐阅读