consul - 几次重启后,kubernetes StatefulSet POD 中的 Microsoft Orleans 崩溃
问题描述
在 K8S 中运行的 Microsoft Orleans v3.4.3 Consul 集群
siloBuilder
.UseConsulClustering(opt =>
{
opt.Address = new Uri(AppConfig.Orleans.ConsulUrl);
opt.AclClientToken = AppConfig.Orleans.AclClientToken;
})
.Configure<ClusterOptions>(options =>
{
options.ClusterId = AppConfig.Orleans.ClusterID;
options.ServiceId = AppConfig.Orleans.ServiceID;
})
.siloBuilder.UseKubernetesHosting();
我根据文档为我的 POD 配置了标签和环境变量。
- name: ORLEANS_SERVICE_ID #Required by Orleans
valueFrom:
fieldRef:
fieldPath: metadata.labels['orleans/serviceId']
- name: ORLEANS_CLUSTER_ID #Required by Orleans
valueFrom:
fieldRef:
fieldPath: metadata.labels['orleans/clusterId']
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['statefulset.kubernetes.io/pod-name']
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
它是一个 StatefulSet,只有 1 个 POD 用于测试。在初次启动时,它运行良好。但是,每次我重新启动 POD 时,都会在 Consul 中创建一个新条目。
并且它在随后的启动中崩溃。
日志说
System.AggregateException: One or more errors occurred. (Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184])
---> Orleans.Runtime.MembershipService.OrleansClusterConnectivityCheckFailedException: Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184]
at Orleans.Runtime.MembershipService.MembershipAgent.ValidateInitialConnectivity()
at Orleans.Runtime.MembershipService.MembershipAgent.BecomeActive()
at Orleans.Runtime.MembershipService.MembershipAgent.<>c__DisplayClass26_0.<<Orleans-ILifecycleParticipant<Orleans-Runtime-ISiloLifecycle>-Participate>g__OnBecomeActiveStart|6>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)
at Orleans.LifecycleSubject.OnStart(CancellationToken ct)
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken)
at Orleans.Hosting.SiloHost.StartAsync(CancellationToken cancellationToken)
at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at UBS.OrleansServer.EntryPoint.Start() in /app/UBS/OrleansServer/EntryPoint.cs:line 102
--- End of inner exception stack trace ---
我必须删除 Consul 中的所有条目,然后重新启动 POD,然后一切正常。
StatefulSet的POD_NAME
POD 也是一样,每次 POD 重启都会在 Consul 中创建一个新条目是否正确?
可能是什么原因?
提前致谢
更新 经过几轮崩溃并重新启动,终于不再崩溃了。在日志中我看到以下消息
ProcessTableUpdate (called from DeclareDead) membership table: 5 silos, 1 are Active, 4 are Dead, Version=<31, 28123>. All silos: [SiloAddress=S10.18.123.244:11111:361163684 SiloName=ubs-job-dev-0 Status=Active, SiloAddress=S10.18.123.200:11111:361158057 SiloName=ubs-job-dev-0 Status=Dead, SiloAddress=S10.18.123.210:11111:361161905 SiloName=ubs-job-dev-0 Status=Dead, SiloAddress=S10.18.123.217:11111:361157424 SiloName=ubs-job-dev-0 Status=Dead, SiloAddress=S10.18.123.244:11111:361163558 SiloName=ubs-job-dev-0 Status=Dead]
永不改变,SiloName
StatefulSet 中只有一个 POD,但它看到 5 个 silo,其中 4 个已死。似乎每个新的 POD,即使 pod 名称没有改变,也被视为一个新的筒仓。这是预期的吗?
解决方案
(Failed to get ping responses from 1 of 1 active silos.
Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster.
Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184])
看起来您的会员表(在领事中)认为您已经在其中有活动的孤岛。当您的“新”筒仓出现并查看会员表时,它会active
在表的 IP 地址中看到这些筒仓。
为了保持集群正确,新的joining
筒仓必须能够与现有筒仓通信。但是,如果成员资格表不正确(状态为 3/活动的 IP 地址),那么您会遇到问题,即新筒仓尝试 pingactive
筒仓并且无法访问它们将失败join
并自行快速。
你有几个解决方案:
- 部署解决方案时清除 consul 表
- 在每个部署上更改部署 ID。
您显然找到了第一个解决方案(清除表格)
推荐阅读
- c# - 如何使用双向数据绑定而不是一对一转换?
- sql-server - GROUP BY 最新用户但最新日期没有用户
- python - 我想使用正则表达式通过方括号拆分字符串
- django - 将 Selenium 制作的屏幕截图上传到 Amazon S3
- python - 用于无状态应用程序的 Python Flask-WTF CSRF
- javascript - Redux 状态改变后怎么办?
- python - 什么是迭代工具的更快替代方案?
- javascript - 当方法已经在基于类的组件中将其上下文作为“this”时,为什么需要将方法绑定到“this”
- java - 为什么 2020 年 3 月 30 日和 2020 年 3 月 1 日之间的差异错误地给出了 28 天而不是 29 天?
- r - 如何按项目对数据框进行排序?