首页 > 技术文章 > Error reading packet from server: Lost connection to MySQL server during query ( server_errno=20131)

voldemort 2020-11-27 17:39 原文

问题描述:

一晚上收到四五次主从断开、恢复的短信,查看生产环境(主从架构)MySQL(5.6)从库错误日志的ERROR信息,总体来说,是从库丢失了与主库的连接,导致IO Thread重新连接。

2019-12-02 03:46:44 47114 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
2019-12-02 03:46:44 47114 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binlog.002295' at position 386140629
2019-12-02 03:46:44 47114 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2019-12-02 03:46:54 47114 [ERROR] Slave I/O: error reconnecting to master 'repli@192.168.11.10:3306' - retry-time: 60  retries: 1, Error_code: 2013
2019-12-02 03:48:04 47114 [ERROR] Slave I/O: error reconnecting to master 'repli@192.168.11.10:3306' - retry-time: 60  retries: 2, Error_code: 2013
2019-12-02 03:49:14 47114 [ERROR] Slave I/O: error reconnecting to master 'repli@192.168.11.10:3306' - retry-time: 60  retries: 3, Error_code: 2013
2019-12-02 03:50:24 47114 [ERROR] Slave I/O: error reconnecting to master 'repli@192.168.11.10:3306' - retry-time: 60  retries: 4, Error_code: 2013
2019-12-02 03:51:34 47114 [ERROR] Slave I/O: error reconnecting to master 'repli@192.168.11.10:3306' - retry-time: 60  retries: 5, Error_code: 2013
2019-12-02 03:52:44 47114 [ERROR] Slave I/O: error reconnecting to master 'repli@192.168.11.10:3306' - retry-time: 60  retries: 6, Error_code: 2013
2019-12-02 03:53:54 47114 [ERROR] Slave I/O: error reconnecting to master 'repli@192.168.11.10:3306' - retry-time: 60  retries: 7, Error_code: 2013
2019-12-02 03:55:04 47114 [ERROR] Slave I/O: error reconnecting to master 'repli@192.168.11.10:3306' - retry-time: 60  retries: 8, Error_code: 2013
2019-12-02 03:56:06 47114 [Note] Slave: connected to master 'repli@192.168.11.10:3306',replication resumed in log 'binlog.002295' at position 386140629
2019-12-02 04:01:00 47114 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
2019-12-02 04:01:00 47114 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binlog.002295' at position 45291556

查询官方文档:

B.4.2.3 Lost connection to MySQL server

There are three likely causes for this error message.

Usually it indicates network connectivity trouble and you should check the condition of your network if this error occurs frequently. If the error message includes “during query,” this is probably the case you are experiencing. 

Sometimes the “during query” form happens when millions of rows are being sent as part of one or more queries. If you know that this is happening, you should try increasing net_read_timeout from its default of 30 seconds to 60 seconds or longer, sufficient for the data transfer to complete.

More rarely, it can happen when the client is attempting the initial connection to the server. In this case, if your connect_timeout value is set to only a few seconds, you may be able to resolve the problem by increasing it to ten seconds, perhaps more if you have a very long distance or slow connection. You can determine whether you are experiencing this more uncommon cause by using SHOW GLOBAL STATUS LIKE 'Aborted_connects'. It will increase by one for each initial connection attempt that the server aborts. You may see “reading authorization packet” as part of the error message; if so, that also suggests that this is the solution that you need.

If the cause is none of those just described, you may be experiencing a problem with BLOB values that are larger than max_allowed_packet, which can cause this error with some clients. Sometime you may see an ER_NET_PACKET_TOO_LARGE error, and that confirms that you need to increase max_allowed_packet. 

根据官方文档,逐步排查:

  • 首先可以排除网络原因,因为是局域网,网络环境没有问题,也不存在防火墙拦截等;

  • 由于在从库的报错信息中包含有during query,没有其他如:reading authorization packet,ER_NET_PACKET_TOO_LARGE等内容,因此本次的错误原因应该是下面这里:

Sometimes the “during query” form happens when millions of rows are being sent as part of one or more queries. If you know that this is happening, you should try increasing net_read_timeout from its default of 30 seconds to 60 seconds or longer, sufficient for the data transfer to complete.
根据官方建议,可增加参数net_read_timeout的值
  • 结合主从复制的场景,查询二进制日志,并询问业务人员,主库在该时间段会产生许多大的事务。由于默认从库连接主库读取数据时,默认读取30秒(net_read_timeout参数的默认值),30s后从库连接进程会关闭。具体来说,从库的IO Thread在读取事务时,由于事务太大,可能在30秒内读不完。但是从库解析到事务不完整,认为是与主库的连接出现问题,因此会重新连接主库。

但是调整从库的net_read_timeout=900后,仍然会出现该错误!!

因此,猜想可能是主库向从库发送数据时,在限定时间(net_write_timeout,默认30s)内不能发送完一个大事务,因此无论从库读取时间多长,都不能读完该事务。

因此调整主库的net_write_timeout=300

后面没有继续报该错误。

推荐阅读