首页 > 解决方案 > Python Dedupe.io 从 SQL Server 读取数据的问题

问题描述

我正在尝试从 SQL Server 中提取一个大型数据集,并使用 Python 的重复数据删除库对信息进行重复数据删除。我使用 pyodbc 作为数据库连接器,但我无法弄清楚如何使用 SQL Server 将数据转换为正确的格式。在 MySQL 上工作正常,但没有读取 Dict 行,数据的格式使我无法理解。目前,我看到以下错误:

TypeError:行索引必须是整数,而不是 str

这是尝试构建数据的代码:

cur = con.cursor()

print("\n\nExecuiting TOMIS Select")
cur.execute(TOMISSelect)
print("\nSelect Complete")
colHeader = [column[0] for column in cur.description]
temp_d = {0:tuple(colHeader)}
temp_data = {(i+1): row for i, row in enumerate(cur)}
temp_d.update(temp_data)

if os.path.exists(training_file):
    print("\nReading labeled examples from ", training_file)
    with open(training_file) as tf:
        deduper.prepare_training(temp_d, tf)
else:
    print("\nManual Training")
    deduper.prepare_training(temp_d)

这是输出和完整跟踪:

Manual Training
Traceback (most recent call last):

  File "C:\Users\01-workspace\02-dedupe\TOMISDeDupe\TomisFullDeDupe.py", line 134, in <module>
    deduper.prepare_training(temp_d)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\api.py", line 806, in prepare_training
    self.sample(data, sample_size, blocked_proportion, original_length)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\api.py", line 838, in sample
    index_include=examples)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\labeler.py", line 403, in __init__
    self.candidates = super().sample(data, blocked_proportion, sample_size)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\labeler.py", line 43, in sample
    data)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 22, in blockedSample
    *args))

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 62, in dedupeSamplePredicates
    items)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 73, in dedupeSamplePredicate
    column = record[field]

TypeError: row indices must be integers, not str

我尝试了多种不同的方法从 SQL Server 读取数据,但无济于事 - MySQL 查询将数据转储为正确的字典格式,我似乎无法使用 SQL Server 以正确的格式获取数据。

标签: pythonsql-serverduplicatespython-dedupe

解决方案


我认为你需要做类似的事情

colHeader = tuple(column[0] for column in cur.description)
temp_d = {i: dict(zip(colHeader, row)) for i, row in enumerate(cur)}

推荐阅读