首页 > 解决方案 > 从 postgres 中删除重复项

问题描述

我需要帮助编写查询以选择具有重复 productID 的行,该表为 4 列

id,property_id,status,price
20,13356,sold,200000
24,78436,sold,730000
12504,13356,sold,200000
...

我目前有以下 python 脚本

from psycopg2.extensions import AsIs
import psycopg2
conn = psycopg2.connect(...)
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)


def get_dict_sql(cur, query, single=False):
    cur.execute(query)
    if single:
        return dict(cur.fetchone())
    z = cur.fetchall()
    return [dict(row) for row in z]


columns = ['property_id', 'status', 'price']
seen = set()
rows = get_dict_sql(cursor, "SELECT * FROM listings")
insert_statement = 'insert into listings_temp (%s) values %s'
for row in rows:
    if row['product_id'] in seen:
        continue
    seen.add(row['product_id'])
    values = [row[column] for column in columns]
    cursor.execute(insert_statement)
    q2 = cursor.mogrify(insert_statement, (AsIs(','.join(columns)), tuple(values)))
    cursor.execute(q2)

conn.commit()

我在 26 小时前创建了一个新表来存储新数据和这个脚本,但仍然没有完成,有没有办法只选择 product_id 重复的行?或者更好的是直接在 Postgres 中执行的查询?

标签: postgresqlpsycopg2

解决方案


获取重复项的PostgreSQL方法:

演示:db<>小提琴

这给你重复:

SELECT
    *
FROM (
    SELECT
        *,
        row_number() OVER (PARTITION BY product_id)
    FROM
        listings
) s
WHERE row_number >= 2

row_number()窗口函数将行计数添加到某个组的每个元素(这里PARTITIONproduct_ids)。有了它,您只能获取行数所在的那些>= 2

直接移除获取的记录,可以将SELECT语句与DELETE语句组合:

分步演示:db<>fiddle

DELETE FROM t
WHERE id IN
(
    SELECT
        id
    FROM (
        SELECT
            *,
            row_number() OVER (PARTITION BY product_id)
        FROM
            t
    ) s
    WHERE row_number >= 2
);

推荐阅读