sql - 使用 postgresql 删除无索引表中的重复记录
问题描述
我有一个如下所示的表格
Subject_id subject_name Standard Rank Previous_subject_id
13 ABC 1st 1 21
13 ABC 1st 1 23
13 ABC 1st 1 13
25 def 3rd 6 42
25 def 3rd 6 25
25 def 3rd 6 28
25 XYZ 2nd 7 26
29 PQR 1st 1 31
如您所见,除了previous_subject_id
列(一行)之外,所有列和值都是相同的。
规则1
如果在规则 1 之后仍有重复项,我想做的是删除所有满足条件的人subject_id = previous_subject_id
?
规则 2
如果仍有重复的subject_ids,则只保留第一条(发生的)记录
正如您在下面的示例输出中看到的那样,我只保留了第一个出现的记录。
我希望输出如下所示
Subject_id subject_name Standard Rank Previous_subject_id
13 ABC 1st 1 21
25 def 3rd 6 42
25 XYZ 2nd 7 26
29 PQR 1st 1 31
唯一的问题是我的表有 285000 条记录并且没有被索引。删除记录后,我将能够将索引设置为subject_id
为它们变得唯一。
这是我尝试过的
select * from subject_class a
inner join
subject_class b
on a.subject_id = b.previous_subject_id
虽然上述查询由于索引问题而持续运行很长时间,但有什么有效的方法吗?
但是我该如何丢弃它们呢?
请问可以帮我吗?
解决方案
我不明白你为什么使用 aJOIN
当这看起来很简单:
DELETE FROM subject_class WHERE subject_id = previous_subject_id
?
另外,285,000 行也不算多,性能应该还可以。但是,285,000 * 285,000(810 亿)是一个很大的数字,这基本上就是您使用 aJOIN
进行的查询必须解决的问题。
好的,现在我们有一个问题。在关系数据库中,没有“第一个”或“最后一个”的概念。行没有任何固有的顺序,除非你告诉他们要订购的东西。在您的示例中,您从列表中直观地选择了两行,这纯粹是因为当您列出它们时,这是它们出现的顺序。但是,该顺序是完全不确定的。它实际上可能是数据插入堆(非索引表)的顺序,但这几乎不可能复制,并且超出了这个问题的范围。
我能做的是提供一种确定的方式来删除行。因为这比较复杂,所以我会设置一些测试数据:
DECLARE @subject_class TABLE (
subject_id INT,
subject_name VARCHAR(20),
[standard] VARCHAR(20),
[rank] INT,
previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
这基本上是您的设置,您在没有索引的表中列出的数据。
第一部分很简单:
DELETE FROM @subject_class WHERE subject_id = previous_subject_id; --fixes 2 records
第二部分稍微复杂一些,所以我使用了一个公用表表达式:
WITH cte AS (
SELECT
subject_id,
MIN(previous_subject_id) AS min_previous_subject_id
FROM
@subject_class
GROUP BY
subject_id)
DELETE
s
FROM
@subject_class s
INNER JOIN cte c ON c.subject_id = s.subject_id AND c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;
首先确定previous_subject_id
每个的最小值subject_id
并假设这是我们想要保留的唯一一个。还有很多其他方法可以做到这一点,您可以选择最高值,或者提出一些更复杂的规则。
这不会给你你所要求的,而是你得到的结果:
subject_id subject_name standard rank previous_subject_id
13 ABC 1st 1 21
25 def 3rd 6 28
但是,这是确定性的,因为每次运行查询时都会得到相同的结果。
您希望查询只删除“其他”字段匹配的行,所以这里是:
DECLARE @subject_class TABLE (
subject_id INT,
subject_name VARCHAR(20),
[standard] VARCHAR(20),
[rank] INT,
previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABF', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;
INSERT INTO @subject_class SELECT 25, 'dez', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
DELETE FROM @subject_class WHERE subject_id = previous_subject_id;
WITH cte AS (
SELECT
subject_id,
subject_name,
[standard],
[rank],
MIN(previous_subject_id) AS min_previous_subject_id
FROM
@subject_class
GROUP BY
subject_id,
subject_name,
[standard],
[rank])
DELETE
s
FROM
@subject_class s
INNER JOIN cte c ON c.subject_id = s.subject_id
AND c.subject_name = s.subject_name
AND c.[standard] = s.[standard]
AND c.[rank] = s.[rank]
WHERE
c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;
这次我们最终得到了 3 行: - “dez” 的行仍然被删除,因为它具有相同的 subject_id 和 previous_subject_id;- 保留“ABF”行,因为它与主题名称不匹配。
这次使用更新的数据:
DECLARE @subject_class TABLE (
subject_id INT,
subject_name VARCHAR(20),
[standard] VARCHAR(20),
[rank] INT,
previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
INSERT INTO @subject_class SELECT 25, 'XYZ', '2nd', 7, 26;
INSERT INTO @subject_class SELECT 29, 'PQR', '1st', 1, 31;
DELETE FROM @subject_class WHERE subject_id = previous_subject_id;
WITH cte AS (
SELECT
subject_id,
subject_name,
[standard],
[rank],
MIN(previous_subject_id) AS min_previous_subject_id
FROM
@subject_class
GROUP BY
subject_id,
subject_name,
[standard],
[rank])
DELETE
s
FROM
@subject_class s
INNER JOIN cte c ON c.subject_id = s.subject_id
AND c.subject_name = s.subject_name
AND c.[standard] = s.[standard]
AND c.[rank] = s.[rank]
WHERE
c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;
我得到以下结果:
subject_id subject_name standard rank previous_subject_id
13 ABC 1st 1 21
25 def 3rd 6 28
25 XYZ 2nd 7 26
29 PQR 1st 1 31
哪个符合您的预期?不完全是,但那是因为当没有这样的概念时,您仍在使用“第一个”。我得到的行数相同,结果基本相同。我只是选择与您不同的行来保留。
推荐阅读
- android - 用户发送短信时如何关闭短信应用程序并返回活动?
- javascript - 如何在我的 TreeItem Material UI 4 中添加图标
- javascript - 用ajax和flask连续调用python函数
- node.js - 面临的问题:警告:React.createElement:类型无效——需要一个字符串
- html - 必须通过在我的 html 文件中使用 create.js 来获取输出?
- c# - 缺少主调度程序的模块。添加提供 Main 调度程序的依赖项,例如 'kotlinx-coroutines-android'
- julia - 将一列 DateTime 添加到现有数据框
- python - 使用python将多个excel附加到放置在文件夹中的单个excel中
- regex - 如何使用 bash 或正则表达式重命名具有多个字母和数字组合以及大小的多个文件?
- powershell - 测量对象没有输出正确的结果