hadoop - 长期运行应用程序的 YARN Kerberos 问题
问题描述
我们正在尝试设置一个长时间运行的 YARN 应用程序,该应用程序应该在具有 Yarn 集群模式的 Hadoop 集群中运行超过委托令牌的生命周期[7 天]。
根据链接YARN Security,我们已完成以下步骤
- 在 YARN 客户端中将 keytab 上传到 HDFS
- 将密钥表路径和主体信息从 YARN 客户端传递给 ApplicationMaster。
- 将 keytab 下载到 Application Master 运行的节点
75% 的令牌[HDFS_DELEGATION_TOKEN] 到期后,由以下代码创建一个新令牌
UserGroupInformation.loginUserFromKeytab(props.getUserName(), keytabPath); if (UserGroupInformation.isSecurityEnabled()) { Credentials creds = UserGroupInformation.getCurrentUser().getCredentials(); final Token<?> tokens[] = fs.addDelegationTokens("app", creds); for(Token<?> token : creds.getAllTokens()) { log.info(" " +token); if(token != null && token.getKind().toString().equals(HDFS_DELEGATION_TOKEN)) { DelegationTokenIdentifier id = (DelegationTokenIdentifier)token.decodeIdentifier(); long diff = id.getMaxDate() - id.getIssueDate(); long maxDiff = Math.round(Long.valueOf(diff).doubleValue() * RELOGIN_PERCENT); reloginTimestamp = id.getIssueDate() + maxDiff; issueTimestamp = id.getIssueDate(); log.info("expiry date " + id.getMaxDate()); log.info(RELOGIN_PERCENT*100 +"% of expiry date " + reloginTimestamp); } } allTokens = IgniteYarnUtils.createTokenBuffer(creds); }
一段时间后我们低于异常
18/07/16 17:27:07 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm116
18/07/16 17:27:07 INFO retry.RetryInvocationHandler: Exception while invoking allocate of class ApplicationMasterProtocolPBClientImpl over rm116 after 645 fail over attempts. Trying to fail over after sleeping for 1996ms.
java.net.ConnectException: Call failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor45.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1508)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy15.allocate(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy16.allocate(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:277)
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:648)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744)
at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
... 12 more
恳请您的指导。
提前致谢
解决方案
推荐阅读
- graph - 具有节点属性的无向时态网络数据集
- vue.js - Laravel + vue:错误消息只显示句子的第一个字母
- python - 如何压缩这个 Python Tkinter GUI 输入代码?
- regex - 正则表达式需要匹配正确的整数货币格式,并设置最大美元价值
- python-3.x - 重命名文件名中的多个特殊字符
- python - 如何从 GitHub API 获取所有后续版本并将它们包含在 Python 中的列表中?
- android - ActivityResultLauncher 为权限请求返回额外的 false
- webpack - 构建后更改 next.config.js 变量
- javascript - 处理不同文件中的 Firebase 身份验证时的错误
- php - laravel 8 +惯性js中的会话超时