首页 > 解决方案 > pandas: iterate over dataframe, do SQL query for each row

问题描述

I have a dataframe and a 5 million row local Postgres database. In each row of the dataframe, I want to add a column that is the result of a query against the Postgres database.

This is what I have right now:

for index, row in df_tf.iterrows():
    row = dict(row)
    id = row['National ID']
    q = 'select name from companies where company_number=%s'
    cursor.execute(q, [company_number])
    results = cursor.fetchall()
    if len(results):
        row['name'] = result[0][0]
        writer.writerow(row)
    else:
        row['name'] = ''
        writer.writerow(row)

So I'm iterating over the rows and writing the results to a local CSV.

Is there a way I could do this more neatly, and keep the results in a local dataframe?

I know I could load the Postgres data into pandas and join directly, but it's rather large and slow, so I would prefer to use a Postgres query.

标签: pythonpandas

解决方案


我想我的第一眼看起来是这样的(未经测试):

import pandas
import psycopg2
import csv
import contextlib

def get_company_name(cursor, company_number):
    query = 'SELECT name FROM companies WHERE company_number=%s;'
    cursor.execute(query, [company_number])
    results = cursor.fetchone()
    return results[0] if results else ''

df_tf = pandas.DataFrame("...")
with contextlib.ExitStack() as ctx:
    connection = ctx.enter_context(psycopg2.connect("..."))
    cursor = ctx.enter_context(connection.cursor())
    file_out = ctx.enter_context(open("results.csv", "w"))
    writer = csv.DictWriter(file_out, fieldnames=["National ID", "Name"])
    writer.writeheader()
    for _, row in df_tf.iterrows():
        row = dict(row)
        row['Name'] = get_company_name(cursor, row['National ID'])
        writer.writerow(row)

根据数据框中的数据,缓存来自get_company_name(). 我想有更好的答案,但这是我会尝试的。


推荐阅读