首页 > 解决方案 > 在 WAL 归档期间点燃 FileAlreadyExistsException

问题描述

我们正在使用 Gridgain 版本:8.8.10 JDK 版本:1.8

我们在 Azure Kubernetes 中有 3 个节点的 Ignite 集群。我们启用了本机持久性。我们的一些 Ignite pod 将进入 CrashLoopBackOff,但有以下例外

[07:45:45,477][WARNING][main][FileWriteAheadLogManager] Content of WAL working directory needs rearrangement, some WAL segments will be moved to archive: /gridgain/walarchive/node00-71fcf5d3-faf7-4d2b-abae-bd0621bb12a1. Segments from 0000000000000001.wal to 0000000000000008.wal will be moved, total number of files: 8. This operation may take some time.
[07:45:45,480][SEVERE][main][IgniteKernal] Exception during start processors, node will be stopped and close connections
class org.apache.ignite.IgniteCheckedException: Failed to start processor: GridProcessorAdapter []
    at org.apache.ignite.internal.IgniteKernal.startProcessor(IgniteKernal.java:1938)
    at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1159)
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:1787)
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1711)
    at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1141)
    at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1059)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:945)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:844)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:714)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:683)
    at org.apache.ignite.Ignition.start(Ignition.java:344)
    at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:290)
Caused by: class org.apache.ignite.internal.processors.cache.persistence.StorageException: Failed to move WAL segment [src=/gridgain/wal/node00-71fcf5d3-faf7-4d2b-abae-bd0621bb12a1/0000000000000001.wal, dst=/gridgain/walarchive/node00-71fcf5d3-faf7-4d2b-abae-bd0621bb12a1/0000000000000001.wal]
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.moveSegmentsToArchive(FileWriteAheadLogManager.java:3326)
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.prepareAndCheckWalFiles(FileWriteAheadLogManager.java:1542)
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.start0(FileWriteAheadLogManager.java:494)
    at org.apache.ignite.internal.processors.cache.GridCacheSharedManagerAdapter.start(GridCacheSharedManagerAdapter.java:60)
    at org.apache.ignite.internal.processors.cache.GridCacheProcessor.start(GridCacheProcessor.java:605)
    at org.apache.ignite.internal.IgniteKernal.startProcessor(IgniteKernal.java:1935)
    ... 11 more
Caused by: java.nio.file.FileAlreadyExistsException: /gridgain/walarchive/node00-71fcf5d3-faf7-4d2b-abae-bd0621bb12a1/0000000000000001.wal
    at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:450)
    at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
    at java.base/java.nio.file.Files.move(Files.java:1422)
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.moveSegmentsToArchive(FileWriteAheadLogManager.java:3307)
    ... 16 more
[07:45:45,482][SEVERE][main][IgniteKernal] Got exception while starting (will rollback startup routine).

似乎在 WAL 归档期间创建了同名文件,并且无法覆盖该文件。我们缺少 WAL 归档期间的任何特定配置吗?

 <property name="dataStorageConfiguration">
            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
                <!-- set the size of wal segments to 128MB -->
                <property name="walSegmentSize" value="#{128 * 1024 * 1024}"/>
                <property name="writeThrottlingEnabled" value="true"/>
                <!-- Set the page size to 8 KB -->
                <property name="pageSize" value="#{8 * 1024}"/>
                <property name="defaultDataRegionConfiguration">
                    <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                        <property name="name" value="Default_Region"/>
                        <!-- Memory region of 20 MB initial size. -->
                        <property name="initialSize" value="#{20 * 1024 * 1024}"/>
                        <!-- Memory region of 8 GB max size. -->
                        <property name="maxSize" value="#{8L * 1024 * 1024 * 1024}"/>
                        <!-- Enabling eviction for this memory region. -->
                        <property name="pageEvictionMode" value="RANDOM_2_LRU"/>
                        <property name="persistenceEnabled" value="true"/>
                        <!-- Increasing the buffer size to 1 GB. -->
                        <property name="checkpointPageBufferSize" value="#{1024L * 1024 * 1024}"/>
                    </bean>
                </property>
                <property name="walPath" value="/gridgain/wal"/>
                <property name="walArchivePath" value="/gridgain/walarchive"/>
            </bean>
        </property>

任何人都曾在使用 Ignite Kubernetes Cluster 时遇到过类似的问题。

我们在 GKE 中观察到这一点。在 AKS 中它工作正常。我们正在使用 Apache Ignite Operator。

https://ignite.apache.org/docs/latest/installation/kubernetes/gke-deployment

标签: ignitegridgain

解决方案


推荐阅读