首页 > 解决方案 > 如何在本地运行胶水作业?

问题描述

我有这里描述的设置项目。但是代码:

import com.amazonaws.services.glue.{AWSGlueClientBuilder, GlueContext}
import org.apache.spark.SparkContext
import org.slf4j.LoggerFactory

object MyGlueJob {
  private val logger = LoggerFactory.getLogger(getClass)
  def main(sysArgs: Array[String]) {

    val spark: SparkContext = SparkContext.getOrCreate()
    val glueContext: GlueContext = new GlueContext(spark)
    val awsGlueClient = AWSGlueClientBuilder.defaultClient
  }
}

失败并出现错误:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/11/21 15:40:32 INFO SparkContext: Running Spark version 2.4.3
19/11/21 15:40:33 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
    at MyGlueJob$.main(MyGlueJob.scala:13)
    at MyGlueJob.main(MyGlueJob.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
19/11/21 15:40:33 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
    at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$postApplicationEnd(SparkContext.scala:2416)
    at org.apache.spark.SparkContext$$anonfun$stop$1.apply$mcV$sp(SparkContext.scala:1931)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:585)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
    at MyGlueJob$.main(MyGlueJob.scala:13)
    at MyGlueJob.main(MyGlueJob.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
19/11/21 15:40:33 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
    at MyGlueJob$.main(MyGlueJob.scala:13)
    at MyGlueJob.main(MyGlueJob.scala)
    ... 5 more

很明显应该设置主 url,但是如何从命令行或系统变量中设置呢?(例如不接触代码)

我也有 [read] 该--master参数可以解决问题,但将其添加到 args 什么也没做(这里是 Intellij Idea 运行配置):

在此处输入图像描述

关键问题是在本地运行胶水作业并能够在不接触代码的情况下在aws中运行它,这可能吗?

标签: scalaamazon-web-servicesapache-sparkaws-glue

解决方案


您可以显式创建 Spark 会话并设置所需的任何参数。但我不能说这最终会在 Glue 中起作用。以下是我用来在本地测试 Spark 作业的本地会话,即使我最终在 Glue 中运行它们。我只测试纯火花代码。

  lazy val spark: SparkSession = {
    UserGroupInformation.setLoginUser(UserGroupInformation.createRemoteUser("hduser"))
    SparkSession
      .builder()
      .master("local")
      .appName("spark unit test")
      .getOrCreate()
  }

关键问题是在本地运行胶水作业并能够在不接触代码的情况下在aws中运行它,这可能吗?

可以使用开发端点和 Zeppelin 运行任何代码。请参阅aws 文档


推荐阅读