首页 > 解决方案 > 从分区表中删除查询失败后处于恢复模式的 Postgres (PG 12)

问题描述

我有一个代码曾经在一个简单的表上工作,当同一个表被分区为许多子分区时停止工作。

分布式应用程序 (Spark) 中,我们的代码可以同时从不同的计算机并行执行批量删除查询(删除不同的记录)。

大多数查询都有效,但其中一个查询似乎是套接字连接超时而失败:

java.sql.BatchUpdateException: Batch entry 0 DELETE FROM my_table WHERE vessel_id='xxxxxx' AND day='2020-09-15 00:00:00+00'::timestamp was aborted: An I/O error occurred while sending to the backend.  Call getNextException to see other errors in the batch.

Caused by: java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:210)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)

When the code retries to run the task the connection fails on 
:FATAL:  the database system is in recovery mode

在数据库日志中,我看到:

2020-09-21 16:44:27 UTC::@:[26848]:DETAIL:  Failed process was running: DELETE FROM my_table WHERE vessel_id=$1 AND day=$2
2020-09-21 16:44:27 UTC::@:[26848]:LOG:  terminating any other active server processes
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres@postgres:[27705]:WARNING:  terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres@postgres:[27705]:DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres@postgres:[27705]:HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin@[unknown]:[26740]:WARNING:  terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin@[unknown]:[26740]:DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin@[unknown]:[26740]:HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC::@:[22480]:WARNING:  terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC::@:[22480]:DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC::@:[22480]:HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC:127.0.0.1(31826):rdsadmin@rdsadmin:[27967]:FATAL:  the database system is in recovery mode

任何想法为什么在表分区时数据库失败?为什么其他计算机上的所有其他连接都关闭并且数据库进入恢复模式?

标签: postgresqlapache-spark

解决方案


查看日志后,我发现问题是内存不足。这个数据库实例是主实例,它负责写入、复制和删除,它没有足够的内存来同时处理所有这些任务。

解决方法只是添加更多内存。没有什么花哨。


推荐阅读