sql - 按优先顺序从表中删除重复项
问题描述
我有带有示例数据的表格:
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id | url | description | description_hash | city | latitude | longitude | service | sid |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| 1 | www.website.com/sdadsd12d1 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris | 48.85670 | 2.35146 | website | 1 |
| 2 | www.page.com/gfdg3df2f2 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris | 48.85670 | 2.35146 | page | 2 |
| 3 | www.site.com/sdjbhsjhd17 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris | 48.85670 | 2.35146 | site | 3 |
| 4 | www.site.com/sdsdadqwd12 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886 | site | 3 |
| 5 | www.page.com/dgdg2wg3 | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886 | page | 2 |
| 6 | www.webpage.com/8f8fj2h | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886 | webpage | 4 |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
我的任务是删除重复行。我想要 description_hash、服务和纬度(城市)的独特组合。直到今天,我一直在使用以下查询:
update my_table_data
set description_hash = md5(description::text)
where description_hash is null;
DROP VIEW temp_view_duplicates;
CREATE VIEW temp_view_duplicates AS WITH A
AS (
SELECT Distinct
description_hash
, service
FROM my_table_data
)
, B
AS (
SELECT description_hash
FROM A
GROUP BY
description_hash
HAVING COUNT(*) > 1
), C
AS (
SELECT A.description_hash,
A.service
FROM A
JOIN B
ON A.description_hash = B.description_hash
order by description_hash
), D AS
(
select distinct latitude, description_hash, service
from my_table_data
where description_hash in (SELECT description_hash FROM C)
order by description_hash
), E AS
(SELECT description_hash, latitude
FROM D
GROUP BY
description_hash, latitude
HAVING COUNT(*) > 1)
SELECT min(ctid) as min_ctid, description_hash, latitude
FROM my_table_data
WHERE description_hash in (SELECT description_hash FROM E)
group by description_hash, latitude
order by description_hash;
DELETE FROM my_table_data a USING (
SELECT min_ctid, description_hash, latitude
FROM temp_view_duplicates
) b
WHERE a.description_hash = b.description_hash AND a.latitude = b.latitude
AND a.ctid <> b.min_ctid;
结果很好,但是查询会根据不同的 ctid 随机删除行。示例结果:
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id | url | description | description_hash | city | latitude | longitude | service | sid |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| 1 | www.website.com/sdadsd12d1 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris | 48.85670 | 2.35146 | website | 1 |
| 4 | www.site.com/sdsdadqwd12 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886 | site | 3 |
| 5 | www.page.com/dgdg2wg3 | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886 | page | 2 |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
现在我想将我的查询更改为在删除时考虑服务(sid)的顺序(优先级)的查询。
2 个带有优先级列表的示例结果:
优先事项:
- 网页
- 页
- 地点
- 网站
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id | url | description | description_hash | city | latitude | longitude | service | sid |
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| 2 | www.page.com/gfdg3df2f2 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris | 48.85670 | 2.35146 | page | 2 |
| 4 | www.site.com/sdsdadqwd12 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886 | site | 3 |
| 6 | www.webpage.com/8f8fj2h | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886 | webpage | 4 |
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
优先事项:
- 地点
- 页
- 网站
- 网页
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id | url | description | description_hash | city | latitude | longitude | service | sid |
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| 3 | www.site.com/sdjbhsjhd17 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris | 48.85670 | 2.35146 | site | 3 |
| 4 | www.site.com/sdsdadqwd12 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886 | site | 3 |
| 5 | www.page.com/dgdg2wg3 | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886 | page | 2 |
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
一般的:
- 我使用postgresql
- 我使用 md5 哈希,因为描述很长而且需要太多时间
- 我想每天在 1M 行上执行这个查询
有人知道吗?我整天都在想这个,我有问题。我考虑根据 description_hash 的顺序自定义排序行。
编辑:
优先级是什么?
当前查询随机删除重复记录,即基于 ctid,因此我无法控制删除哪些记录。
我的问题是我想控制它并能够通过优先级列表来定义它,从最重要的站点到最不重要的站点。
删除的逻辑应该如下 - 当您点击重复时,验证它们来自哪些站点并将优先级列表中的站点保留在最高位置。
用于测试的 SQL:
create table my_table_data(
id int,
url text,
description text,
description_hash text,
city text,
latitude double precision,
longitude double precision,
service text,
sid int
);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(1, 'www.website.com/sdadsd12d1', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Paris', 48.85670, 2.35146, 'website', 1);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(2, 'www.page.com/gfdg3df2f2', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Paris', 48.85670, 2.35146, 'page', 2);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(3, 'www.site.com/sdjbhsjhd17', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Paris', 48.85670, 2.35146, 'site', 3);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(4, 'www.site.com/sdsdadqwd12', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Berlin', 52.51704, 13.38886, 'site', 3);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(5, 'www.page.com/dgdg2wg3', 'Some description here version 2', '764ed2b4f0d28e45332816c7beedb706', 'Berlin', 52.51704, 13.38886, 'page', 2);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(6, 'www.webpage.com/8f8fj2h', 'Some description here version 2', '764ed2b4f0d28e45332816c7beedb706', 'Berlin', 52.51704, 13.38886, 'webpage', 4);
解决方案
除了笼统地回答这个问题,我不知道如何回答,但让我试试看。
根据我的最佳猜测,您定义的“重复”是同一位置(纬度/经度或城市)的相同描述(或描述哈希)。
如果是这种情况,那么为了删除这些记录并通过定义的可排序条件优先考虑其中的某些记录,应该可以使用row_number
分析功能。
例如,此查询将优先考虑最低的“sid”,其中行号 = 1 的任何内容都将被保留,而其他所有内容都将被删除。如果您想要除 sid 或二级/三级条件以外的其他条件,则只需将这些条件添加到“order by:”中
select
id, url, description, description_hash, city, latitude, longitude, service, sid,
row_number() over (partition by description_hash, latitude, longitude order by sid) as rn
from my_table_data
在这种情况下,可以通过一个简单的步骤删除这些记录:
with dupes as (
select
id, row_number() over (partition by description_hash, latitude, longitude order by sid) as rn
from my_table_data
)
delete from my_table_data m
where exists (
select null
from dupes d
where
d.id = m.id and
d.rn > 1
)
解析函数和半连接都非常高效,1M的记录应该很快。
希望这为您提供了完成任务的构建块。
推荐阅读
- android - 微调器逐步选择取决于所选值
- swift - 在 Realm 数据库中创建然后更新现有元素的问题
- docker - Docker + GitLab CI - 成功测试后将图像标记为“最新”
- groovy - 如何使用 groovy 脚本在 Jmeter 中验证有效的 GUID?
- sql - 查询行数 - Yii
- azure - 在 Azure 中的 RDLC 报告上显示 Century Gothic 字体
- vue.js - 'vuex-persistedstate' 不会调用操作并重新加载页面
- python-3.x - CentOS 7 - 使用 RPM 工具安装 MonoDevelop 和 Python 3.0(不要使用 yum)
- c# - Engine.Razor.RunCompile 错误
- angular - 需要简单值时的 Angular 模块