sql - 按优先顺序从表中删除重复项

问题描述

我有带有示例数据的表格：

+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id |            url             |           description           |         description_hash         |  city  | latitude | longitude | service | sid |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
|  1 | www.website.com/sdadsd12d1 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris  | 48.85670 | 2.35146   | website |   1 |
|  2 | www.page.com/gfdg3df2f2    | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris  | 48.85670 | 2.35146   | page    |   2 |
|  3 | www.site.com/sdjbhsjhd17   | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris  | 48.85670 | 2.35146   | site    |   3 |
|  4 | www.site.com/sdsdadqwd12   | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886  | site    |   3 |
|  5 | www.page.com/dgdg2wg3      | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886  | page    |   2 |
|  6 | www.webpage.com/8f8fj2h    | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886  | webpage |   4 |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+

我的任务是删除重复行。我想要 description_hash、服务和纬度（城市）的独特组合。直到今天，我一直在使用以下查询：

    update my_table_data
    set description_hash = md5(description::text)
    where description_hash is null;

    DROP VIEW temp_view_duplicates;
    CREATE VIEW temp_view_duplicates AS WITH A   
    AS  (
       SELECT Distinct
              description_hash
         ,    service
       FROM  my_table_data
    )
    ,   B  
    AS  (
        SELECT description_hash
        FROM   A
        GROUP BY
               description_hash
        HAVING COUNT(*) > 1
    ), C
    AS (
    SELECT  A.description_hash,
            A.service
    FROM    A
        JOIN B
            ON A.description_hash = B.description_hash
            order by description_hash
    ), D AS
    (
    select distinct latitude, description_hash, service
    from my_table_data
    where description_hash in (SELECT description_hash FROM C)
    order by description_hash
    ), E AS
    (SELECT description_hash, latitude
    FROM   D
    GROUP BY
           description_hash, latitude
    HAVING COUNT(*) > 1)
      SELECT min(ctid) as min_ctid, description_hash, latitude
    FROM   my_table_data
        WHERE description_hash in (SELECT description_hash FROM E)
        group by description_hash, latitude
        order by description_hash;
                                    
    DELETE FROM my_table_data a USING (
      SELECT min_ctid, description_hash, latitude
        FROM  temp_view_duplicates
      ) b
      WHERE a.description_hash = b.description_hash AND a.latitude = b.latitude
      AND a.ctid <> b.min_ctid;

结果很好，但是查询会根据不同的 ctid 随机删除行。示例结果：

+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id |            url             |           description           |         description_hash         |  city  | latitude | longitude | service | sid |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
|  1 | www.website.com/sdadsd12d1 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris  | 48.85670 | 2.35146   | website |   1 |
|  4 | www.site.com/sdsdadqwd12   | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886  | site    |   3 |
|  5 | www.page.com/dgdg2wg3      | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886  | page    |   2 |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+

现在我想将我的查询更改为在删除时考虑服务（sid）的顺序（优先级）的查询。

2 个带有优先级列表的示例结果：

优先事项：

网页
页
地点
网站

+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id |           url            |           description           |         description_hash         |  city  | latitude | longitude | service | sid |
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
|  2 | www.page.com/gfdg3df2f2  | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris  | 48.85670 | 2.35146   | page    |   2 |
|  4 | www.site.com/sdsdadqwd12 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886  | site    |   3 |
|  6 | www.webpage.com/8f8fj2h  | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886  | webpage |   4 |
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+

优先事项：

地点
页
网站
网页

+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id |           url            |           description           |         description_hash         |  city  | latitude | longitude | service | sid |
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
|  3 | www.site.com/sdjbhsjhd17 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris  | 48.85670 | 2.35146   | site    |   3 |
|  4 | www.site.com/sdsdadqwd12 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886  | site    |   3 |
|  5 | www.page.com/dgdg2wg3    | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886  | page    |   2 |
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+

一般的：

我使用postgresql
我使用 md5 哈希，因为描述很长而且需要太多时间
我想每天在 1M 行上执行这个查询

有人知道吗？我整天都在想这个，我有问题。我考虑根据 description_hash 的顺序自定义排序行。

编辑：

优先级是什么？

当前查询随机删除重复记录，即基于 ctid，因此我无法控制删除哪些记录。

我的问题是我想控制它并能够通过优先级列表来定义它，从最重要的站点到最不重要的站点。

删除的逻辑应该如下 - 当您点击重复时，验证它们来自哪些站点并将优先级列表中的站点保留在最高位置。

用于测试的 SQL：

create table my_table_data(
id int,
    url text,
    description text,
    description_hash text,
    city text,
    latitude double precision,
    longitude double precision,
    service text,
    sid int
);

insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
values(1, 'www.website.com/sdadsd12d1', 'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Paris',    48.85670,   2.35146,    'website',  1);

insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
values(2, 'www.page.com/gfdg3df2f2',    'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Paris',    48.85670,   2.35146,    'page', 2);

insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
values(3, 'www.site.com/sdjbhsjhd17',   'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Paris',    48.85670,   2.35146,    'site', 3);

insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
values(4, 'www.site.com/sdsdadqwd12',   'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Berlin',   52.51704,   13.38886,   'site', 3);

insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
values(5, 'www.page.com/dgdg2wg3',  'Some description here version 2',  '764ed2b4f0d28e45332816c7beedb706', 'Berlin',   52.51704,   13.38886,   'page', 2);

insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
values(6, 'www.webpage.com/8f8fj2h',    'Some description here version 2',  '764ed2b4f0d28e45332816c7beedb706', 'Berlin',   52.51704,   13.38886,   'webpage',  4);

标签： sqlpostgresql

除了笼统地回答这个问题，我不知道如何回答，但让我试试看。

根据我的最佳猜测，您定义的“重复”是同一位置（纬度/经度或城市）的相同描述（或描述哈希）。

如果是这种情况，那么为了删除这些记录并通过定义的可排序条件优先考虑其中的某些记录，应该可以使用row_number分析功能。

例如，此查询将优先考虑最低的“sid”，其中行号 = 1 的任何内容都将被保留，而其他所有内容都将被删除。如果您想要除 sid 或二级/三级条件以外的其他条件，则只需将这些条件添加到“order by：”中

select
  id, url, description, description_hash, city, latitude, longitude, service, sid,
  row_number() over (partition by description_hash, latitude, longitude order by sid) as rn
from my_table_data

在这种情况下，可以通过一个简单的步骤删除这些记录：

with dupes as (
  select
    id, row_number() over (partition by description_hash, latitude, longitude order by sid) as rn
  from my_table_data
)
delete from my_table_data m
where exists (
  select null
  from dupes d
  where
    d.id = m.id and
    d.rn > 1
)

解析函数和半连接都非常高效，1M的记录应该很快。

希望这为您提供了完成任务的构建块。

sql - 按优先顺序从表中删除重复项

问题描述

解决方案

推荐阅读