cassandra - cassandra 节点因 oom 错误而被关闭
问题描述
Cassandra 节点由于 OOM 而关闭,并检查了我在下面看到的 /var/log/message。
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java cpuset=/ mems_allowed=0
....
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA32: 1294*4kB (UM) 932*8kB (UEM) 897*16kB (UEM) 483*32kB (UEM) 224*64kB (UEM) 114*128kB (UEM) 41*256kB (UEM) 12*512kB (UEM) 7*1024kB (UE
M) 2*2048kB (EM) 35*4096kB (UM) = 242632kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 Normal: 5319*4kB (UE) 3233*8kB (UEM) 960*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 62500kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 38109 total pagecache pages
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages in swap cache
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Swap cache stats: add 0, delete 0, find 0/0
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Free swap = 0kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Total swap = 0kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 16394647 pages RAM
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages HighMem/MovableOnly
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 310559 pages reserved
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2634] 0 2634 41614 326 82 0 0 systemd-journal
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2690] 0 2690 29793 541 27 0 0 lvmetad
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2710] 0 2710 11892 762 25 0 -1000 systemd-udevd
.....
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [13774] 0 13774 459778 97729 429 0 0 Scan Factory
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14506] 0 14506 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14586] 0 14586 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14588] 0 14588 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14589] 0 14589 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14598] 0 14598 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14599] 0 14599 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14600] 0 14600 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14601] 0 14601 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19679] 0 19679 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19680] 0 19680 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 9084] 1007 9084 2822449 260291 810 0 0 java
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 8509] 1007 8509 17223585 14908485 32510 0 0 java
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21877] 0 21877 461828 97716 318 0 0 ScanAction Mgr
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21884] 0 21884 496653 98605 340 0 0 OAS Manager
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [31718] 89 31718 25474 486 48 0 0 pickup
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4891] 1007 4891 26999 191 9 0 0 iostat
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4957] 1007 4957 26999 192 10 0 0 iostat
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Out of memory: Kill process 8509 (java) score 928 or sacrifice child
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Killed process 8509 (java) total-vm:68894340kB, anon-rss:59496344kB, file-rss:137596kB, shmem-rss:0kB
除了带有搜索和监视代理的 dse cassandra 之外,没有其他东西在该主机上运行。最大堆大小设置为 31g,cassandra java 进程在出错时似乎正在使用 ~57gb(ram 为 62gb)。所以我猜jvm开始使用大量内存并触发了oom错误。我的理解正确吗?这是 linux 触发的 jvm kill,因为 jvm 消耗的内存超过了可用内存?
因此,在这种情况下,jvm 使用的最大值为 31g,剩余的 26gb 使用的是非堆内存。通常这个过程大约需要 42g 并且在 oom 时刻它消耗 57g 的事实我怀疑 java 进程是罪魁祸首而不是受害者。
在问题发生时没有进行堆转储,我现在已经配置了它。但是,即使进行堆转储,它也有助于确定谁在消耗更多内存。heapdump 只会dump heap 内存区域,应该用什么dump non-heapdump?本机内存跟踪是我遇到的一件事。发生 oom 时有什么方法可以转储本机内存?监视 jvm 内存以诊断 oom 错误的最佳方法是什么?
解决方案
这可能没有帮助..
您可能不会得到 heapdump,因为 oom-killer 是内核功能。jvm没有机会写heapdump。并且 SIGKILL 不能被捕获并且不会生成核心转储。(Unix默认操作)
http://programmergamer.blogspot.com/2013/05/clarification-on-sigint-sigterm-sigkill.html
推荐阅读
- python - 当某些键具有多个值时如何按键打印字典
- sql - 无数据时显示数字零
- sql-server - 在这个完整的查询中代码是如何执行的
- r - 我想在一个图上绘制两个数据框
- git - AWS CodeBuild DOWNLOAD_SOURCE 错误:CLIENT_ERROR:未找到参考增量
- spring - 不按顺序同时运行 Spring 测试用例
- java - Java:如何指定相对文件路径?
- regex - Python re, 获取 findall 找到的第一个元素
- javascript - 如何创建一个foreach循环来反向打印数组?
- javascript - 如何在动态组件标签内重用vue中的组件?