python - 加快从 pandas 数据帧到 mysql 的数据插入

问题描述

我需要使用 sqlalchemy 和 python 将 60000x24 数据框插入到 mysql 数据库（MariaDB）中。数据库在本地运行，数据插入也在本地运行。现在我一直在使用 LOAD DATA INFILE sql 查询，但这需要将数据帧转储到 CSV 文件中，这大约需要 1.5-2 秒。问题是我必须插入 40 个或更多这样的数据帧，所以时间很关键。

如果我使用 df.to_sql 那么问题会变得更糟。每个数据帧的数据插入至少需要 7（最多 30）秒。

下面提供了我正在使用的代码：

sql_query ="CREATE TABLE IF NOT EXISTS table(A FLOAT, B FLOAT, C FLOAT)"# 24 columns of type float
cursor.execute(sql_query)
data.to_sql("table", con=connection, if_exists="replace", chunksize=1000)

执行需要 7 到 30 秒。使用 LOAD DATA，代码如下所示：

sql_query = "CREATE TABLE IF NOT EXISTS table(A FLOAT, B FLOAT, C FLOAT)"# 24 columns of type float
cursor.execute(sql_query)
data.to_csv("/tmp/data.csv")
sql_query = "LOAD DATA LOW_PRIORITY INFILE '/tmp/data.csv' REPLACE INTO TABLE 'table' FIELDS TERMINATED BY ','; "
cursor.execute(sql_query)

这需要 1.5 到 2 秒，主要是由于将文件转储到 CSV。我可以通过使用 LOCK TABLES 来稍微改进最后一个，但是没有数据被添加到数据库中。所以，我的问题是，是否有任何方法可以通过调整 LOAD DATA 或 to_sql 来加快这个过程？

更新： 通过使用替代函数将数据帧转储到此答案给出的 CSV 文件中将大型数据帧输出到 CSV 文件的最快方法是什么？我能够提高一点性能，但不是那么显着。最好的，

标签： pythonmysqlpandasperformancemariadb

如果您知道数据格式（我假设所有浮点数），您可以使用它numpy.savetxt()来大大减少创建 CSV 所需的时间：

%timeit df.to_csv(csv_fname)
2.22 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  

from numpy import savetxt
%timeit savetxt(csv_fname, df.values, fmt='%f', header=','.join(df.columns), delimiter=',')
714 ms ± 37.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

请注意，您可能需要预先

df = df.reset_index()

用唯一键编号的行并保留.to_csv()格式样式。

python - 加快从 pandas 数据帧到 mysql 的数据插入

问题描述

解决方案

推荐阅读