首页 > 解决方案 > 在跳过重复项时将 CSV 导入到 postgreSQL

问题描述

我是一个完全的 Python(和编码)初学者,所以这可能很难看。

我有一个 CSV 文件,我想将它导入我的 postgresSQL 数据库。CSV 有大量我不想要的重复项。我相信我可以很好地阅读 CSV,并可以很好地添加到数据库中,但是我在跳过重复项时遇到了麻烦。每次我运行下面的代码时,我都会插入一行,然后它就会失败。

我现在只看关键,但是一旦这个工作正常,还有一大堆其他列,添加 [...]

# Setup

import csv
import psycopg2


# Read a value from the CSV to see if it's in dbItems

with open('meh_0.csv', 'r') as f:
    reader = csv.reader(f)
    next(reader)

    connection = psycopg2.connect("host=localhost dbname=postgres user=postgres port=5433 password=removed")
    cursor = connection.cursor()
    cursor.execute('SELECT handid FROM handlist')
    dbItems = cursor.fetchall()

    print(dbItems)

    for i in range(0, 200):
        rowKey = next(reader)
        print('rowKey[0] is: ' + rowKey[0])

        found = False
        for row in dbItems:
            for element in row:
                if element == int(rowKey[0]):
                    found = True
                    break

            if found:
                break


# Then either add to the DB or skip

        if not found:
            print(rowKey[0] + ' NOT found in dbItems\n')
            sqlCommand = 'INSERT INTO handlist VALUES (' + rowKey[0] + ')'
            cursor.execute(sqlCommand)
            connection.commit()

        else:
            print(rowKey[0] + ' is found in dbItems\n')

我可能已经将一些不需要的东西移到了我的“while”循环中,我试图看看发生了什么变化。哦,最大 200 的范围是任意的,CSV 文件很大。

错误:

rowKey[0] is: 34756717
34756717 is found in dbItems

rowKey[0] is: 34756717
34756717 is found in dbItems

rowKey[0] is: 34756717
34756717 is found in dbItems

rowKey[0] is: 34756718
34756718 NOT found in dbItems

rowKey[0] is: 34756718
34756718 NOT found in dbItems

Traceback (most recent call last):
  File "C:/Python/MyPythonScripts/RIO r5.py", line 40, in <module>
    cursor.execute(sqlCommand)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "handlist_pkey"
DETAIL:  Key (handid)=(34756718) already exists.

>>> 

所以它跳过了我之前运行它时添加的所有键,添加了新的,但是当它迭代循环时不会跳过新的。

大多数情况下,我想知道为什么它不起作用。但我想也有很多更简单的方法可以做到这一点,如果需要,我很乐意复制这些方法。

标签: pythonpython-3.xcsv

解决方案


您可以将现有密钥添加到set,然后检查 csv 行中的密钥是否是该集合的成员。检查集合成员的成本与集合的大小无关,因此在这里使用集合是一个很好的数据结构。如果 csv 键在集合中,我们移动到下一行,否则我们将该行添加到数据库并将键添加到集合中。

# Setup

import csv
import psycopg2


# Read a value from the CSV to see if it's in dbItems

with open('meh_0.csv', 'r') as f:
    reader = csv.reader(f)
    next(reader)

    connection = psycopg2.connect("host=localhost dbname=postgres user=postgres port=5433 password=removed")
    cursor = connection.cursor()
    cursor.execute('SELECT handid FROM handlist')
    dbItems = cursor.fetchall()

    # Build a set of existing keys
    existing = {k for k, in dbItems}


    for i in range(0, 200):
        rowKey = next(reader)
        print('rowKey[0] is: ' + rowKey[0])

        # Database keys are ints, csv values are strings...
        candidate = int(rowKey[0])

        if candidate in existing:
            # back to the top of the for loop
            continue

        # Add to the DB

        # Use the recommended way of building queries
        # https://www.psycopg.org/docs/usage.html#passing-parameters-to-sql-queries
        sqlCommand = 'INSERT INTO handlist VALUES (%s)'
        cursor.execute(sqlCommand, (candidate,))
        connection.commit()

        # Add our key to the set
        existing.add(candidate)

推荐阅读