python - 在跳过重复项时将 CSV 导入到 postgreSQL
问题描述
我是一个完全的 Python(和编码)初学者,所以这可能很难看。
我有一个 CSV 文件,我想将它导入我的 postgresSQL 数据库。CSV 有大量我不想要的重复项。我相信我可以很好地阅读 CSV,并可以很好地添加到数据库中,但是我在跳过重复项时遇到了麻烦。每次我运行下面的代码时,我都会插入一行,然后它就会失败。
我现在只看关键,但是一旦这个工作正常,还有一大堆其他列,添加 [...]
# Setup
import csv
import psycopg2
# Read a value from the CSV to see if it's in dbItems
with open('meh_0.csv', 'r') as f:
reader = csv.reader(f)
next(reader)
connection = psycopg2.connect("host=localhost dbname=postgres user=postgres port=5433 password=removed")
cursor = connection.cursor()
cursor.execute('SELECT handid FROM handlist')
dbItems = cursor.fetchall()
print(dbItems)
for i in range(0, 200):
rowKey = next(reader)
print('rowKey[0] is: ' + rowKey[0])
found = False
for row in dbItems:
for element in row:
if element == int(rowKey[0]):
found = True
break
if found:
break
# Then either add to the DB or skip
if not found:
print(rowKey[0] + ' NOT found in dbItems\n')
sqlCommand = 'INSERT INTO handlist VALUES (' + rowKey[0] + ')'
cursor.execute(sqlCommand)
connection.commit()
else:
print(rowKey[0] + ' is found in dbItems\n')
我可能已经将一些不需要的东西移到了我的“while”循环中,我试图看看发生了什么变化。哦,最大 200 的范围是任意的,CSV 文件很大。
错误:
rowKey[0] is: 34756717
34756717 is found in dbItems
rowKey[0] is: 34756717
34756717 is found in dbItems
rowKey[0] is: 34756717
34756717 is found in dbItems
rowKey[0] is: 34756718
34756718 NOT found in dbItems
rowKey[0] is: 34756718
34756718 NOT found in dbItems
Traceback (most recent call last):
File "C:/Python/MyPythonScripts/RIO r5.py", line 40, in <module>
cursor.execute(sqlCommand)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "handlist_pkey"
DETAIL: Key (handid)=(34756718) already exists.
>>>
所以它跳过了我之前运行它时添加的所有键,添加了新的,但是当它迭代循环时不会跳过新的。
大多数情况下,我想知道为什么它不起作用。但我想也有很多更简单的方法可以做到这一点,如果需要,我很乐意复制这些方法。
解决方案
您可以将现有密钥添加到set,然后检查 csv 行中的密钥是否是该集合的成员。检查集合成员的成本与集合的大小无关,因此在这里使用集合是一个很好的数据结构。如果 csv 键在集合中,我们移动到下一行,否则我们将该行添加到数据库并将键添加到集合中。
# Setup
import csv
import psycopg2
# Read a value from the CSV to see if it's in dbItems
with open('meh_0.csv', 'r') as f:
reader = csv.reader(f)
next(reader)
connection = psycopg2.connect("host=localhost dbname=postgres user=postgres port=5433 password=removed")
cursor = connection.cursor()
cursor.execute('SELECT handid FROM handlist')
dbItems = cursor.fetchall()
# Build a set of existing keys
existing = {k for k, in dbItems}
for i in range(0, 200):
rowKey = next(reader)
print('rowKey[0] is: ' + rowKey[0])
# Database keys are ints, csv values are strings...
candidate = int(rowKey[0])
if candidate in existing:
# back to the top of the for loop
continue
# Add to the DB
# Use the recommended way of building queries
# https://www.psycopg.org/docs/usage.html#passing-parameters-to-sql-queries
sqlCommand = 'INSERT INTO handlist VALUES (%s)'
cursor.execute(sqlCommand, (candidate,))
connection.commit()
# Add our key to the set
existing.add(candidate)
推荐阅读
- angular - 通用继承方法可以处理 Typescript Click 事件吗
- powershell - powershell 转义 unicode (utf8)
- scala - 案例类选项参数的模糊隐含值
- google-apps-script - 使用 Google 表单预订日历
- reactjs - 突变 React Apollo 后触发了不必要的重新获取
- regex - 将常规行向下移动特定数量的行
- oracle - oracle,过程,带参数的表
- node.js - 无法安装 npx create-react-app@3.3.0 nucampsite
- javascript - 如何在 Laravel foreach 循环中显示活动的 Bootstrap Accordion?
- delphi - Delphi TTask:用ExecuteWork在主线程中执行任务