首页 > 解决方案 > Apache spark 与本地 s3 集成

问题描述

我将 spark 与 on prem s3 (Minio) 和 spark 运算符一起使用。因此,当我们将 spark 与 s3 一起使用时(不启用 ssl )。它工作正常。我们能够获取数据、写入数据,并且还可以运行 s3 中存在的主应用程序 jar。但是,在启用 ssl 并提供自信任证书时(目前)。我们开始面临一些问题。

  1. 当我们尝试使用 s3 测试 spark 连接性时。我们运行本地作业 (master=local[]) 并尝试通过将其添加到 jks 信任库来提供我们的 ssl。这解决了我们的问题。

这是命令

./spark-submit \
--master local[*] \
--name ml-pipeline-16 \
--conf spark.ssl.server.keystore.type=jks \
--conf spark.ssl.server.keystore.password=changeit \
--conf spark.ssl.server.keystore.location=/Library/Java/JavaVirtualMachines/jdk1.8.0_281.jdk/Contents/Home/jre/lib/security/cacerts \
--conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.keyStore=/Library/Java/JavaVirtualMachines/jdk1.8.0_281.jdk/Contents/Home/jre/lib/security/cacerts -Djavax.net.ssl.keyStorePassword=changeit" \
--class com.abc.dp.efg.DataSetGenerator \
/Users/ayush.goyal/IdeaProjects/test/target/SparkS3-1.0-SNAPSHOT-jar-with-dependencies.jar 


  1. 当我们尝试通过在 s3 本身中提供应用程序 jar 来运行它时,它无法连接到 s3 来启动作业,因为我们在 sparkconf 参数中提供了证书并出现以下错误。
Exception in thread "main" org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExist on minio21: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null:400 Bad Request: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: null; S3 Extended Request ID: null; Proxy: null)

这是我们的火花运算符 yaml


apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: minio-stg-test-4
spec:
  type: Java
  mode: cluster
  image: "stg-test/spark/spark-py:v3.1.1-h3-2"
  imagePullPolicy: IfNotPresent
  mainClass: com.abc.dp.efg.DataSetGenerator
  mainApplicationFile: "s3a://minio21/jars/SparkS3-1.0-SNAPSHOT-jar-with-dependencies.jar"
  sparkVersion: "3.1.1"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    configMaps:
    - name: minio-certificate
      path: /mnt/config-maps
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.1.1
    serviceAccount: spark-user
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    configMaps:
    - name: minio-certificate
      path: /mnt/config-maps
    cores: 1
    instances: 2
    memory: "512m"
    labels:
      version: 3.1.1
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  sparkConf:
    "spark.kubernetes.file.upload.path": "s3a://minio21/tmp"
    "spark.hadoop.fs.s3a.access.key": "fdgvbsgt"
    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "spark.hadoop.fs.s3a.fast.upload": "true"
    "spark.hadoop.fs.s3a.secret.key": "sfbdfbbsdrbh44q3#$"
    "spark.hadoop.fs.s3a.endpoint": "http://[278b:c1r0:0012:5ed3:b112:2::]:30000"
    "spark.hadoop.fs.s3a.path.style.access": "true"
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName": "OnDemand"
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass": "robin"
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit": "200Gi"
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path": "/tmp/spark-local-dir"
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly": "false"
    "spark.executor.extraJavaOptions": "-Djavax.net.ssl.keyStore=/mnt/config-maps/cacerts -Djavax.net.ssl.keyStorePassword=changeit"
    "spark.ui.port": "4041"
    "spark.ssl.server.keystore.type": "jks"
    "spark.ssl.server.keystore.password": "changeit"
    "spark.ssl.server.keystore.location": "/mnt/config-maps/cacerts"

注意:之前我们通过禁用 ssl 来使用 s3,我们可以通过在 s3 中提供应用程序 jar 来运行我们的工作,就像我们在上面的 yaml 中所做的那样。

我们怎样才能像尝试做的那样做我们的工作?可能吗?

标签: apache-sparksslamazon-s3spark-operator

解决方案


推荐阅读