ignite - 在 Apache Ignite 上访问原子引用或 long 时线程卡住
问题描述
这是关于我们一直面临的一个相当近期的问题。我们运行 2 个客户端实例和 26 个 apache ignite 实例。都是 AWS R4.2xLarge 节点。最近我们看到了这个问题,当尝试获取 atomicLong 或 atomicReference 时,正在执行的线程会卡住并且不会返回。此问题通常发生在 1 或 2 个 ignite 实例上。我不确定为什么会发生这种情况,因此我们将不胜感激任何帮助。
这是尝试获取 atomicReference 时的线程转储:
"main" #1 prio=5 os_prio=0 cpu=3528.41ms elapsed=1067.33s allocated=312M defined_classes=9309 tid=0x00007f4ce4046fc0 nid=0x1537 waiting on condition [0x00007f4cece90000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
- parking to wait for <0x00007f4cbfe7c7d0> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicReference(DataStructuresProcessor.java:744)
at org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3743)
at org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3732)
at company.explore.cache.persist.SavedAudienceLocationProvider.getSavedAudienceLocation(SavedAudienceLocationProvider.java:27)
at company.explore.listeners.lifecycle.LifecycleListener.configureSavedAudienceLocation(LifecycleListener.java:45)
at company.explore.listeners.lifecycle.LifecycleListener.onLifecycleEvent(LifecycleListener.java:38)
at org.apache.ignite.internal.IgniteKernal.notifyLifecycleBeans(IgniteKernal.java:725)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1156)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
- locked <0x00007f4cbf072a38> (a org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance)
at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
at org.apache.ignite.Ignition.start(Ignition.java:348)
at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
由于这被卡住了,任何 Ignition.ignite 调用也会失败并导致作业无法完成:
"pub-#22" #48 prio=5 os_prio=0 cpu=5.76ms elapsed=1036.50s allocated=421K defined_classes=6 tid=0x00007f4ce4cf3990 nid=0x1607 waiting on condition [0x00007f40375f6000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
- parking to wait for <0x00007f4cbf16d9e0> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
at org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:7657)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1671)
at org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1389)
at org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1258)
at org.apache.ignite.Ignition.ignite(Ignition.java:489)
at company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:58)
at company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:31)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C2.execute(GridClosureProcessor.java:1855)
at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
同样,这是一个线程在尝试获取 atomicLong 时正在等待 CountDownLatch 的实例:
"pub-#489" #608 prio=5 os_prio=0 cpu=16.80ms elapsed=7076.10s allocated=2409K defined_classes=17 tid=0x00007f48c8014c60 nid=0x5bd5 waiting on condition [0x00007f48359e1000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
- parking to wait for <0x00007f518aba6060> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicLong(DataStructuresProcessor.java:463)
at org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3716)
at org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3705)
at company.explore.cache.persist.person.SerializationStatus.getSerializeCounter(SerializationStatus.java:86)
at company.explore.cache.persist.person.SerializationStatus.startNodeSerialization(SerializationStatus.java:21)
at company.explore.cache.persist.personv2.PersonSerializationJob.serializePeopleData(PersonSerializationJob.java:98)
at company.explore.cache.persist.personv2.PersonSerializationJob.run(PersonSerializationJob.java:75)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1944)
at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
这些问题在过去 2 个月左右才开始出现。长期以来,系统本身一直非常稳定。我没有发布整个线程转储,因为它会很大。如果需要,我可以将其发布到 pastebin 或上传到某个地方。
由于这确实不是一个非常一致的问题,我不确定如何创建复制器项目。但如果需要,我可以提供任何日志。
编辑:
整个线程转储已发布在 pastebin 上。请找到以下链接:
原子参考相关的线程转储:pastebin.com/ydNMFSEP
Atomic Long 相关线程转储:pastebin.com/psJgwi3F
解决方案
推荐阅读
- ddev - DDEV - 配置多个环境
- python - (Python) 从 json 文件中读取特定的键
- azure-data-factory - 为什么在 Azure 数据工厂中调用 Slack 后 webhook 永远不会完成?
- networking - Ubuntu systemd-resolve 没有为某些域使用正确的 DNS 服务器
- javascript - 为什么使用函数时元素的端点值是错误的
- typescript - 来自 Array[object..] 的逻辑运算符
- sql - Postgresql 列引用不明确
- python - 关于tweepy的私信请求
- sql - SQL:导入和导出数据向导 64 位不显示 Excel 工作簿选项
- javascript - 使用 Firestore 进行学生评分