首页 > 解决方案 > 使用 postgresql 删除无索引表中的重复记录

问题描述

我有一个如下所示的表格

Subject_id  subject_name  Standard Rank Previous_subject_id
13              ABC            1st    1          21
13              ABC            1st    1          23   
13              ABC            1st    1          13
25              def            3rd    6          42   
25              def            3rd    6          25
25              def            3rd    6          28
25              XYZ            2nd    7          26
29              PQR            1st    1          31         

如您所见,除了previous_subject_id列(一行)之外,所有列和值都是相同的。

规则1

如果在规则 1 之后仍有重复项,我想做的是删除所有满足条件的人subject_id = previous_subject_id

规则 2

如果仍有重复的subject_ids,则只保留第一条(发生的)记录

正如您在下面的示例输出中看到的那样,我只保留了第一个出现的记录。

我希望输出如下所示

Subject_id  subject_name  Standard Rank Previous_subject_id
13              ABC            1st    1          21
25              def            3rd    6          42
25              XYZ            2nd    7          26
29              PQR            1st    1          31  

唯一的问题是我的表有 285000 条记录并且没有被索引。删除记录后,我将能够将索引设置为subject_id为它们变得唯一。

这是我尝试过的

select * from subject_class a
inner join 
subject_class b
on a.subject_id = b.previous_subject_id

虽然上述查询由于索引问题而持续运行很长时间,但有什么有效的方法吗?

但是我该如何丢弃它们呢?

请问可以帮我吗?

标签: sqlpostgresqlindexingpsqlsql-delete

解决方案


我不明白你为什么使用 aJOIN当这看起来很简单:

DELETE FROM subject_class WHERE subject_id = previous_subject_id?

另外,285,000 行也不算多,性能应该还可以。但是,285,000 * 285,000(810 亿)是一个很大的数字,这基本上就是您使用 aJOIN进行的查询必须解决的问题。


好的,现在我们有一个问题。在关系数据库中,没有“第一个”或“最后一个”的概念。行没有任何固有的顺序,除非你告诉他们要订购的东西。在您的示例中,您从列表中直观地选择了两行,这纯粹是因为当您列出它们时,这是它们出现的顺序。但是,该顺序是完全不确定的。它实际上可能是数据插入堆(非索引表)的顺序,但这几乎不可能复制,并且超出了这个问题的范围。

我能做的是提供一种确定的方式来删除行。因为这比较复杂,所以我会设置一些测试数据:

DECLARE @subject_class TABLE (
    subject_id INT,
    subject_name VARCHAR(20),
    [standard] VARCHAR(20),
    [rank] INT,
    previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;

这基本上是您的设置,您在没有索引的表中列出的数据。

第一部分很简单:

DELETE FROM @subject_class WHERE subject_id = previous_subject_id; --fixes 2 records

第二部分稍微复杂一些,所以我使用了一个公用表表达式:

WITH cte AS (
    SELECT
        subject_id,
        MIN(previous_subject_id) AS min_previous_subject_id
    FROM
        @subject_class
    GROUP BY
        subject_id)
DELETE
    s
FROM
    @subject_class s
    INNER JOIN cte c ON c.subject_id = s.subject_id AND c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;

首先确定previous_subject_id每个的最小值subject_id并假设这是我们想要保留的唯一一个。还有很多其他方法可以做到这一点,您可以选择最高值,或者提出一些更复杂的规则。

这不会给你你所要求的,而是你得到的结果:

subject_id  subject_name    standard    rank    previous_subject_id
13          ABC             1st         1       21
25          def             3rd         6       28

但是,这是确定性的,因为每次运行查询时都会得到相同的结果。


您希望查询只删除“其他”字段匹配的行,所以这里是:

DECLARE @subject_class TABLE (
    subject_id INT,
    subject_name VARCHAR(20),
    [standard] VARCHAR(20),
    [rank] INT,
    previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABF', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  
INSERT INTO @subject_class SELECT 25, 'dez', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;

DELETE FROM @subject_class WHERE subject_id = previous_subject_id;

WITH cte AS (
    SELECT
        subject_id,
        subject_name,
        [standard],
        [rank],
        MIN(previous_subject_id) AS min_previous_subject_id
    FROM
        @subject_class
    GROUP BY
        subject_id,
        subject_name,
        [standard],
        [rank])
DELETE
    s
FROM
    @subject_class s
    INNER JOIN cte c ON c.subject_id = s.subject_id 
        AND c.subject_name = s.subject_name 
        AND c.[standard] = s.[standard]
        AND c.[rank] = s.[rank]
WHERE
    c.min_previous_subject_id != s.previous_subject_id;

SELECT * FROM @subject_class;

这次我们最终得到了 3 行: - “dez” 的行仍然被删除,因为它具有相同的 subject_id 和 previous_subject_id;- 保留“ABF”行,因为它与主题名称不匹配。


这次使用更新的数据:

DECLARE @subject_class TABLE (
    subject_id INT,
    subject_name VARCHAR(20),
    [standard] VARCHAR(20),
    [rank] INT,
    previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
INSERT INTO @subject_class SELECT 25, 'XYZ', '2nd', 7, 26;
INSERT INTO @subject_class SELECT 29, 'PQR', '1st', 1, 31;

DELETE FROM @subject_class WHERE subject_id = previous_subject_id;

WITH cte AS (
    SELECT
        subject_id,
        subject_name,
        [standard],
        [rank],
        MIN(previous_subject_id) AS min_previous_subject_id
    FROM
        @subject_class
    GROUP BY
        subject_id,
        subject_name,
        [standard],
        [rank])
DELETE
    s
FROM
    @subject_class s
    INNER JOIN cte c ON c.subject_id = s.subject_id 
        AND c.subject_name = s.subject_name 
        AND c.[standard] = s.[standard]
        AND c.[rank] = s.[rank]
WHERE
    c.min_previous_subject_id != s.previous_subject_id;

SELECT * FROM @subject_class;

我得到以下结果:

subject_id  subject_name    standard    rank    previous_subject_id
13          ABC             1st         1       21
25          def             3rd         6       28
25          XYZ             2nd         7       26
29          PQR             1st         1       31

哪个符合您的预期?不完全是,但那是因为当没有这样的概念时,您仍在使用“第一个”。我得到的行数相同,结果基本相同。我只是选择与您不同的行来保留。


推荐阅读