首页 > 解决方案 > Postgres.exe 崩溃并关闭所有应用程序,恢复并再次运行

问题描述

我正在运行一个应用程序,其中约有 20 个进程连接到 Windows Server 2016 上的 postgres DB (10.0)。

大约一个月以来,我意外地崩溃了 postgres.exe。为了隔离问题,我通过设置 log_min_duration_statement = 0 扩展了日志记录

这将创建更详细的日志文件。我能看到的是:

LOG:  server process (PID xxxxx) was terminated by exception
0xFFFFFFFF DETAIL:  Failed process was running: COMMIT HINT:  See C
include file "ntstatus.h" for a description of the hexadecimal value.

然后它会像这样拆除所有 20 个进程:

LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.Then DB recovers:
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2021-06-11 18:17:18 CEST

DB 进入恢复模式

FATAL:  the database system is in recovery mode
FATAL:  the database system is in recovery mode
FATAL:  the database system is in recovery mode
FATAL:  the database system is in recovery mode
LOG:  database system was not properly shut down; automatic recovery in progress
...
LOG:  redo starts at 1B2/33319E58
FATAL:  the database system is in recovery mode
LOG:  invalid record length at 1B2/33D29930: wanted 24, got 0
LOG:  redo done at 1B2/33D29908
LOG:  last completed transaction was at log time 2021-06-11 18:21:39.830526+02
FATAL:  the database system is in recovery mode
...
FATAL:  the database system is in recovery mode
LOG:  database system is ready to accept connections

现在它又像往常一样运行了

我可以识别为 20 个应用程序进程之一运行的 postgres.exe 崩溃的 PID xxxxx。它并不总是同一个。这大约每 5-10 天发生一次。

谁能给我一些建议如何追查这次崩溃的原因?

使用的扩展:

oracle_fdw 2.0.0, PostgreSQL 10.0, Oracle client 11.2.0.3.0, Oracle server 11.2.0.2.0

崩溃转储:

按照链接: https ://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows

尽管 postgres 用户对安全选项卡中的 crashdump 文件夹具有“完全控制权”,但它不会写入任何内容。文件夹保持为空。

对@Laurenz Albe 评论的跟进:COMMIT 不是崩溃的原因。这是会话的最后一个成功执行的命令。在下面的例子中解释:

流程得到一份工作并开始做它的工作

2021-06-15 16:27:51.100 CEST [25604] LOG:  duration: 0.061 ms  statement: DISCARD ALL
2021-06-15 16:27:51.100 CEST [25604] LOG:  duration: 0.012 ms  statement: BEGIN
2021-06-15 16:27:51.100 CEST [25604] LOG:  duration: 0.015 ms  statement: SET TRANSACTION ISOLATION LEVEL READ COMMITTED

现在在 25604 会话中进行了很多操作,其中包括 oracle 外部数据包装器

2021-06-15 16:28:13.792 CEST [25604] LOG:  duration: 0.016 ms  execute <unnamed>: FETCH ALL FROM "<unnamed portal 689>"

成功完成操作(数据库中的事务数据)

2021-06-15 16:28:13.823 CEST [25604] LOG:  duration: 0.059 ms  statement: COMMIT

甲骨文外国数据包装器在不同的会议中进行了很多行动

7 分钟后请求下一个作业,现在 postgres.exe 崩溃

2021-06-15 16:36:01.524 CEST [17904] LOG:  server process (PID 25604) was terminated by exception 0xFFFFFFFF

进程不做 DISCARD ALL, BEGIN 和 SET TRANSACTION ISOLATION LEVEL READ COMMITTED 它立即崩溃

我的结论:“可能损坏的共享内存”是由之前的一个进程启动的。最后一次成功的 COMMIT 和新请求之间的含义。这是发生问题的 7 分钟时间跨度。

关于这个结论的一些反馈?

标签: postgresql

解决方案


推荐阅读