azure - 将 Typesafe 配置文件传递给 Azure Databricks 中的 Spark 提交作业
问题描述
我正在尝试将 Typesafe 配置文件传递给 spark 提交任务并在配置文件中打印详细信息。
import org.slf4j.{Logger, LoggerFactory}
import com.typesafe.config.{Config, ConfigFactory}
import org.apache.spark.sql.SparkSession
object Bootstrap extends MyLogging {
val spark: SparkSession = SparkSession.builder.enableHiveSupport().getOrCreate()
val config: Config = ConfigFactory.load("application.conf")
def main(args: Array[String]): Unit = {
val url: String = config.getString("db.url")
val user: String = config.getString("db.user")
println(url)
println(user)
}
}
application.conf 文件:
db {
url = "jdbc:postgresql://localhost:5432/test"
user = "test"
}
我已将 application.conf 文件上传到 dbfs 并使用相同的路径来创建作业。
Spark 提交作业 JSON :
{
"new_cluster": {
"spark_version": "6.4.x-esr-scala2.11",
"azure_attributes": {
"availability": "ON_DEMAND_AZURE",
"first_on_demand": 1,
"spot_bid_max_price": -1
},
"node_type_id": "Standard_DS3_v2",
"enable_elastic_disk": true,
"num_workers": 1
},
"spark_submit_task": {
"parameters": [
"--class",
"Bootstrap",
"--conf",
"spark.driver.extraClassPath=dbfs:/tmp/",
"--conf",
"spark.executor.extraClassPath=dbfs:/tmp/",
"--files",
"dbfs:/tmp/application.conf",
"dbfs:/tmp/code-assembly-0.1.0.jar"
]
},
"email_notifications": {},
"name": "application-conf-test",
"max_concurrent_runs": 1
}
我使用上面的 json 来创建 spark 提交作业,并尝试使用 datbricks CLI 命令运行 spark-submit 作业。
错误 :
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'db'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:147)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
at Bootstrap$.main(Test.scala:16)
at Bootstrap.main(Test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
我可以在日志中看到以下行,但文件没有被加载。
21/09/22 07:21:43 INFO SparkContext: Added file dbfs:/tmp/application.conf at dbfs:/tmp/application.conf with timestamp 1632295303654
21/09/22 07:21:43 INFO Utils: Fetching dbfs:/tmp/application.conf to /local_disk0/spark-20456b30-fddd-42d7-9b23-9e4c0d3c91cd/userFiles-ee199161-6f48-4c47-b1c7-763ce7c0895f/fetchFileTemp4713981355306806616.tmp
请帮助我使用适当的 spark 提交作业参数将此类型安全的配置文件传递给 spark-submit 作业。
我们在上面的 json 中尝试了以下 spark_submit_task 参数,但仍然面临同样的问题
[
"--class",
"Bootstrap",
"--conf",
"spark.driver.extraClassPath=/tmp/application.conf",
"--files",
"dbfs:/tmp/application.conf",
"dbfs:/tmp/code-assembly-0.1.0.jar"
]
[
"--class",
"Bootstrap",
"--conf",
"spark.driver.extraClassPath=/tmp/",
"--conf",
"spark.executor.extraClassPath=/tmp/",
"--files",
"dbfs:/tmp/application.conf",
"dbfs:/tmp/code-assembly-0.1.0.jar"
]
[
"--class",
"Bootstrap",
"--conf",
"spark.driver.extraClassPath=dbfs:/tmp/application.conf",
"--conf",
"spark.executor.extraClassPath=dbfs:/tmp/application.conf",
"--files",
"dbfs:/tmp/application.conf",
"dbfs:/tmp/code-assembly-0.1.0.jar"
]
[
"--class",
"Bootstrap",
"--conf",
"spark.driver.extraClassPath=dbfs:/tmp/",
"--conf",
"spark.executor.extraClassPath=dbfs:/tmp/",
"--files",
"dbfs:/tmp/application.conf",
"dbfs:/tmp/code-assembly-0.1.0.jar"
]
[
"--class",
"Bootstrap",
"--conf",
"spark.driver.extraClassPath=dbfs:./",
"--conf",
"spark.executor.extraClassPath=dbfs:./",
"--files",
"dbfs:/tmp/application.conf",
"dbfs:/tmp/code-assembly-0.1.0.jar"
]
[
"--class",
"Bootstrap",
"--driver-java-options",
"-Dconfig.file=application.conf",
"--conf",
"spark.executor.extraJavaOptions=-Dconfig.file=application.conf",
"--files",
"dbfs:/tmp/application.conf",
"dbfs:/tmp/code-assembly-0.1.0.jar"
]
[
"--class",
"Bootstrap",
"--conf",
"spark.driver.extraJavaOptions=-Dconfig.file=application.conf",
"--conf",
"spark.executor.extraJavaOptions=-Dconfig.file=application.conf",
"--files",
"dbfs:/tmp/application.conf",
"dbfs:/tmp/code-assembly-0.1.0.jar"
]
解决方案
将文件名作为参数显式传递给作业更容易,并将其称为/dbfs/tmp/application.conf
(您需要在代码中处理该参数):
[
"--class",
"Bootstrap",
"dbfs:/tmp/code-assembly-0.1.0.jar",
"/dbfs/tmp/application.conf"
]
或通过额外选项参考:
[
"--class",
"Bootstrap",
"--conf",
"spark.driver.extraJavaOptions=-Dconfig.file=/dbfs/tmp/application.conf",
"dbfs:/tmp/code-assembly-0.1.0.jar"
]
推荐阅读
- php - 我应该使用什么来从 php 中的用户 ID 和密码字段中获取值?我得到一个带有给定代码的空字符串
- bash - bash脚本案例语句无法使用对话框
- sql-server - Get-SqlInstance 连接失败
- dji-sdk - 使用 GetVisionDetectionStateAsync() 检索视觉数据的问题
- asp.net-core - 试图找出一个使用属性名称字符串 C# .Net Core 在 IQueryable 上进行 OrderBy 的好方法
- python - 从互联网访问数据
- ruby-on-rails - Rails 5 ActiveStorage:为什么我的表单在重新渲染页面时忘记了选定的文件?
- c# - 如何使用列名从数据表中检索整个列
- laravel - Vue + Laravel:在单个 Vue 组件中包含字体
- c# - 我可以在 C# 中的一行中定义委托、实例和代码吗?