首页 > 解决方案 > 在 Apache Ignite 上访问原子引用或 long 时线程卡住

问题描述

这是关于我们一直面临的一个相当近期的问题。我们运行 2 个客户端实例和 26 个 apache ignite 实例。都是 AWS R4.2xLarge 节点。最近我们看到了这个问题,当尝试获取 atomicLong 或 atomicReference 时,正在执行的线程会卡住并且不会返回。此问题通常发生在 1 或 2 个 ignite 实例上。我不确定为什么会发生这种情况,因此我们将不胜感激任何帮助。

这是尝试获取 atomicReference 时的线程转储:

"main" #1 prio=5 os_prio=0 cpu=3528.41ms elapsed=1067.33s allocated=312M defined_classes=9309 tid=0x00007f4ce4046fc0 nid=0x1537 waiting on condition  [0x00007f4cece90000]
   java.lang.Thread.State: WAITING (parking)
                at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
                - parking to wait for  <0x00007f4cbfe7c7d0> (a java.util.concurrent.CountDownLatch$Sync)
                at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
                at java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
                at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicReference(DataStructuresProcessor.java:744)
                at org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3743)
                at org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3732)
                at company.explore.cache.persist.SavedAudienceLocationProvider.getSavedAudienceLocation(SavedAudienceLocationProvider.java:27)
                at company.explore.listeners.lifecycle.LifecycleListener.configureSavedAudienceLocation(LifecycleListener.java:45)
                at company.explore.listeners.lifecycle.LifecycleListener.onLifecycleEvent(LifecycleListener.java:38)
                at org.apache.ignite.internal.IgniteKernal.notifyLifecycleBeans(IgniteKernal.java:725)
                at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1156)
                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
                - locked <0x00007f4cbf072a38> (a org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance)
                at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
                at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
                at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
                at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
                at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
                at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
                at org.apache.ignite.Ignition.start(Ignition.java:348)
                at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)

由于这被卡住了,任何 Ignition.ignite 调用也会失败并导致作业无法完成:

"pub-#22" #48 prio=5 os_prio=0 cpu=5.76ms elapsed=1036.50s allocated=421K defined_classes=6 tid=0x00007f4ce4cf3990 nid=0x1607 waiting on condition  [0x00007f40375f6000]
   java.lang.Thread.State: WAITING (parking)
                at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
                - parking to wait for  <0x00007f4cbf16d9e0> (a java.util.concurrent.CountDownLatch$Sync)
                at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
                at java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
                at org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:7657)
                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1671)
                at org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1389)
                at org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1258)
                at org.apache.ignite.Ignition.ignite(Ignition.java:489)
                at company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:58)
                at company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:31)
                at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C2.execute(GridClosureProcessor.java:1855)
                at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
                at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
                at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
                at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
                at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)

同样,这是一个线程在尝试获取 atomicLong 时正在等待 CountDownLatch 的实例:

"pub-#489" #608 prio=5 os_prio=0 cpu=16.80ms elapsed=7076.10s allocated=2409K defined_classes=17 tid=0x00007f48c8014c60 nid=0x5bd5 waiting on condition  [0x00007f48359e1000]
   java.lang.Thread.State: WAITING (parking)
                at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
                - parking to wait for  <0x00007f518aba6060> (a java.util.concurrent.CountDownLatch$Sync)
                at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
                at java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
                at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicLong(DataStructuresProcessor.java:463)
                at org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3716)
                at org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3705)
                at company.explore.cache.persist.person.SerializationStatus.getSerializeCounter(SerializationStatus.java:86)
                at company.explore.cache.persist.person.SerializationStatus.startNodeSerialization(SerializationStatus.java:21)
                at company.explore.cache.persist.personv2.PersonSerializationJob.serializePeopleData(PersonSerializationJob.java:98)
                at company.explore.cache.persist.personv2.PersonSerializationJob.run(PersonSerializationJob.java:75)
                at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1944)
                at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
                at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
                at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
                at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
                at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)

这些问题在过去 2 个月左右才开始出现。长期以来,系统本身一直非常稳定。我没有发布整个线程转储,因为它会很大。如果需要,我可以将其发布到 pastebin 或上传到某个地方。

由于这确实不是一个非常一致的问题,我不确定如何创建复制器项目。但如果需要,我可以提供任何日志。

编辑:

整个线程转储已发布在 pastebin 上。请找到以下链接:

原子参考相关的线程转储:pastebin.com/ydNMFSEP

Atomic Long 相关线程转储:pastebin.com/psJgwi3F

标签: ignitegridgain

解决方案


推荐阅读