首页 > 解决方案 > 使用 PySpark 写入 Amazon S3 时,我得到 org/apache/hadoop/fs/StreamCapabilities

问题描述

问题:

我正在尝试将 hadoop-aws 与 pyspark 一起使用,以便能够从 Amazon S3 读取/写入文件。

方法

安装包

通过将其Mavenhadoop-aws坐标及其依赖项传递给spark.jars.packages. 但是,我遇到了org/apache/hadoop/fs/StreamCapabilities错误。

编译火花

./spark-3.0.2/dev/make-distribution.sh --name spark-3.0.2-bin-hadoop2.7.5 --pip --tgz -Phadoop-cloud -Dhadoop.version=2.7.5

当我使用编译版本时,我也得到了同样的错误org/apache/hadoop/fs/StreamCapabilities

这是 spark-3.0.2/jars 的内容

JLargeArrays-1.5.jar                   commons-lang3-3.9.jar                        ivy-2.4.0.jar                           jsr305-3.0.0.jar                         shapeless_2.12-2.3.3.jar
JTransforms-3.1.jar                    commons-logging-1.1.3.jar                    jackson-annotations-2.10.0.jar          jul-to-slf4j-1.7.30.jar                  shims-0.7.45.jar
RoaringBitmap-0.7.45.jar               commons-math3-3.4.1.jar                      jackson-core-2.10.0.jar                 kryo-shaded-4.0.2.jar                    slf4j-api-1.7.30.jar
activation-1.1.1.jar                   commons-net-3.1.jar                          jackson-core-asl-1.9.13.jar             leveldbjni-all-1.8.jar                   slf4j-log4j12-1.7.30.jar
aircompressor-0.10.jar                 commons-text-1.6.jar                         jackson-databind-2.10.0.jar             log4j-1.2.17.jar                         snappy-java-1.1.8.2.jar
algebra_2.12-2.0.0-M2.jar              compress-lzf-1.0.3.jar                       jackson-dataformat-cbor-2.10.0.jar      lz4-java-1.7.1.jar                       spark-catalyst_2.12-3.0.2.jar
antlr4-runtime-4.7.1.jar               core-1.1.2.jar                               jackson-jaxrs-1.9.13.jar                machinist_2.12-0.6.8.jar                 spark-core_2.12-3.0.2.jar
aopalliance-repackaged-2.6.1.jar       curator-client-2.7.1.jar                     jackson-mapper-asl-1.9.13.jar           macro-compat_2.12-1.1.1.jar              spark-graphx_2.12-3.0.2.jar
apacheds-i18n-2.0.0-M15.jar            curator-framework-2.7.1.jar                  jackson-module-paranamer-2.10.0.jar     metrics-core-4.1.1.jar                   spark-hadoop-cloud_2.12-3.0.2.jar
apacheds-kerberos-codec-2.0.0-M15.jar  curator-recipes-2.7.1.jar                    jackson-module-scala_2.12-2.10.0.jar    metrics-graphite-4.1.1.jar               spark-kvstore_2.12-3.0.2.jar
api-asn1-api-1.0.0-M20.jar             flatbuffers-java-1.9.0.jar                   jackson-xc-1.9.13.jar                   metrics-jmx-4.1.1.jar                    spark-launcher_2.12-3.0.2.jar
api-util-1.0.0-M20.jar                 gson-2.2.4.jar                               jakarta.annotation-api-1.3.5.jar        metrics-json-4.1.1.jar                   spark-mllib-local_2.12-3.0.2.jar
arpack_combined_all-0.1.jar            guava-14.0.1.jar                             jakarta.inject-2.6.1.jar                metrics-jvm-4.1.1.jar                    spark-mllib_2.12-3.0.2.jar
arrow-format-0.15.1.jar                hadoop-annotations-2.7.5.jar                 jakarta.validation-api-2.0.2.jar        minlog-1.3.0.jar                         spark-network-common_2.12-3.0.2.jar
arrow-memory-0.15.1.jar                hadoop-auth-2.7.5.jar                        jakarta.ws.rs-api-2.1.6.jar             netty-all-4.1.47.Final.jar               spark-network-shuffle_2.12-3.0.2.jar
arrow-vector-0.15.1.jar                hadoop-aws-2.7.5.jar                         jakarta.xml.bind-api-2.3.2.jar          objenesis-2.5.1.jar                      spark-repl_2.12-3.0.2.jar
audience-annotations-0.5.0.jar         hadoop-azure-2.7.5.jar                       janino-3.0.16.jar                       opencsv-2.3.jar                          spark-sketch_2.12-3.0.2.jar
avro-1.8.2.jar                         hadoop-client-2.7.5.jar                      javassist-3.25.0-GA.jar                 orc-core-1.5.10.jar                      spark-sql_2.12-3.0.2.jar
avro-ipc-1.8.2.jar                     hadoop-common-2.7.5.jar                      javax.servlet-api-3.1.0.jar             orc-mapreduce-1.5.10.jar                 spark-streaming_2.12-3.0.2.jar
avro-mapred-1.8.2-hadoop2.jar          hadoop-hdfs-2.7.5.jar                        jaxb-api-2.2.2.jar                      orc-shims-1.5.10.jar                     spark-tags_2.12-3.0.2.jar
azure-storage-2.0.0.jar                hadoop-mapreduce-client-app-2.7.5.jar        jaxb-runtime-2.3.2.jar                  oro-2.0.8.jar                            spark-unsafe_2.12-3.0.2.jar
breeze-macros_2.12-1.0.jar             hadoop-mapreduce-client-common-2.7.5.jar     jcl-over-slf4j-1.7.30.jar               osgi-resource-locator-1.0.3.jar          spire-macros_2.12-0.17.0-M1.jar
breeze_2.12-1.0.jar                    hadoop-mapreduce-client-core-2.7.5.jar       jersey-client-2.30.jar                  paranamer-2.8.jar                        spire-platform_2.12-0.17.0-M1.jar
cats-kernel_2.12-2.0.0-M4.jar          hadoop-mapreduce-client-jobclient-2.7.5.jar  jersey-common-2.30.jar                  parquet-column-1.10.1.jar                spire-util_2.12-0.17.0-M1.jar
chill-java-0.9.5.jar                   hadoop-mapreduce-client-shuffle-2.7.5.jar    jersey-container-servlet-2.30.jar       parquet-common-1.10.1.jar                spire_2.12-0.17.0-M1.jar
chill_2.12-0.9.5.jar                   hadoop-openstack-2.7.5.jar                   jersey-container-servlet-core-2.30.jar  parquet-encoding-1.10.1.jar              stax-api-1.0-2.jar
commons-beanutils-1.9.4.jar            hadoop-yarn-api-2.7.5.jar                    jersey-hk2-2.30.jar                     parquet-format-2.4.0.jar                 stream-2.9.6.jar
commons-cli-1.2.jar                    hadoop-yarn-client-2.7.5.jar                 jersey-media-jaxb-2.30.jar              parquet-hadoop-1.10.1.jar                threeten-extra-1.5.0.jar
commons-codec-1.10.jar                 hadoop-yarn-common-2.7.5.jar                 jersey-server-2.30.jar                  parquet-jackson-1.10.1.jar               univocity-parsers-2.9.0.jar
commons-collections-3.2.2.jar          hadoop-yarn-server-common-2.7.5.jar          jetty-sslengine-6.1.26.jar              protobuf-java-2.5.0.jar                  xbean-asm7-shaded-4.15.jar
commons-compiler-3.0.16.jar            hive-storage-api-2.7.1.jar                   jetty-util-6.1.26.jar                   py4j-0.10.9.jar                          xercesImpl-2.12.0.jar
commons-compress-1.20.jar              hk2-api-2.6.1.jar                            jetty-util-9.4.34.v20201102.jar         pyrolite-4.30.jar                        xml-apis-1.4.01.jar
commons-configuration-1.6.jar          hk2-locator-2.6.1.jar                        joda-time-2.10.5.jar                    scala-collection-compat_2.12-2.1.1.jar   xmlenc-0.52.jar
commons-crypto-1.1.0.jar               hk2-utils-2.6.1.jar                          json4s-ast_2.12-3.6.6.jar               scala-compiler-2.12.10.jar               xz-1.5.jar
commons-digester-1.8.jar               htrace-core-3.1.0-incubating.jar             json4s-core_2.12-3.6.6.jar              scala-library-2.12.10.jar                zookeeper-3.4.14.jar
commons-httpclient-3.1.jar             httpclient-4.5.6.jar                         json4s-jackson_2.12-3.6.6.jar           scala-parser-combinators_2.12-1.1.2.jar  zstd-jni-1.4.4-3.jar
commons-io-2.4.jar                     httpcore-4.4.12.jar                          json4s-scalap_2.12-3.6.6.jar            scala-reflect-2.12.10.jar
commons-lang-2.6.jar                   istack-commons-runtime-3.0.8.jar             jsp-api-2.1.jar                         scala-xml_2.12-1.2.0.jar

仅使用 hadoop-cloud 编译 spark

./spark-3.0.2/dev/make-distribution.sh --name spark-3.0.2 --pip --tgz -Phadoop-cloud

当我尝试在 Amazon S3 上保存文件时,我收到以下错误: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/store/EtagChecksum

这里构建的罐子:

JLargeArrays-1.5.jar                   commons-lang3-3.9.jar                        ivy-2.4.0.jar                           jsr305-3.0.0.jar                         shapeless_2.12-2.3.3.jar
JTransforms-3.1.jar                    commons-logging-1.1.3.jar                    jackson-annotations-2.10.0.jar          jul-to-slf4j-1.7.30.jar                  shims-0.7.45.jar
RoaringBitmap-0.7.45.jar               commons-math3-3.4.1.jar                      jackson-core-2.10.0.jar                 kryo-shaded-4.0.2.jar                    slf4j-api-1.7.30.jar
activation-1.1.1.jar                   commons-net-3.1.jar                          jackson-core-asl-1.9.13.jar             leveldbjni-all-1.8.jar                   slf4j-log4j12-1.7.30.jar
aircompressor-0.10.jar                 commons-text-1.6.jar                         jackson-databind-2.10.0.jar             log4j-1.2.17.jar                         snappy-java-1.1.8.2.jar
algebra_2.12-2.0.0-M2.jar              compress-lzf-1.0.3.jar                       jackson-dataformat-cbor-2.10.0.jar      lz4-java-1.7.1.jar                       spark-catalyst_2.12-3.0.2.jar
antlr4-runtime-4.7.1.jar               core-1.1.2.jar                               jackson-jaxrs-1.9.13.jar                machinist_2.12-0.6.8.jar                 spark-core_2.12-3.0.2.jar
aopalliance-repackaged-2.6.1.jar       curator-client-2.7.1.jar                     jackson-mapper-asl-1.9.13.jar           macro-compat_2.12-1.1.1.jar              spark-graphx_2.12-3.0.2.jar
apacheds-i18n-2.0.0-M15.jar            curator-framework-2.7.1.jar                  jackson-module-paranamer-2.10.0.jar     metrics-core-4.1.1.jar                   spark-hadoop-cloud_2.12-3.0.2.jar
apacheds-kerberos-codec-2.0.0-M15.jar  curator-recipes-2.7.1.jar                    jackson-module-scala_2.12-2.10.0.jar    metrics-graphite-4.1.1.jar               spark-kvstore_2.12-3.0.2.jar
api-asn1-api-1.0.0-M20.jar             flatbuffers-java-1.9.0.jar                   jackson-xc-1.9.13.jar                   metrics-jmx-4.1.1.jar                    spark-launcher_2.12-3.0.2.jar
api-util-1.0.0-M20.jar                 gson-2.2.4.jar                               jakarta.annotation-api-1.3.5.jar        metrics-json-4.1.1.jar                   spark-mllib-local_2.12-3.0.2.jar
arpack_combined_all-0.1.jar            guava-14.0.1.jar                             jakarta.inject-2.6.1.jar                metrics-jvm-4.1.1.jar                    spark-mllib_2.12-3.0.2.jar
arrow-format-0.15.1.jar                hadoop-annotations-2.7.4.jar                 jakarta.validation-api-2.0.2.jar        minlog-1.3.0.jar                         spark-network-common_2.12-3.0.2.jar
arrow-memory-0.15.1.jar                hadoop-auth-2.7.4.jar                        jakarta.ws.rs-api-2.1.6.jar             netty-all-4.1.47.Final.jar               spark-network-shuffle_2.12-3.0.2.jar
arrow-vector-0.15.1.jar                hadoop-aws-2.7.4.jar                         jakarta.xml.bind-api-2.3.2.jar          objenesis-2.5.1.jar                      spark-repl_2.12-3.0.2.jar
audience-annotations-0.5.0.jar         hadoop-azure-2.7.4.jar                       janino-3.0.16.jar                       opencsv-2.3.jar                          spark-sketch_2.12-3.0.2.jar
avro-1.8.2.jar                         hadoop-client-2.7.4.jar                      javassist-3.25.0-GA.jar                 orc-core-1.5.10.jar                      spark-sql_2.12-3.0.2.jar
avro-ipc-1.8.2.jar                     hadoop-common-2.7.4.jar                      javax.servlet-api-3.1.0.jar             orc-mapreduce-1.5.10.jar                 spark-streaming_2.12-3.0.2.jar
avro-mapred-1.8.2-hadoop2.jar          hadoop-hdfs-2.7.4.jar                        jaxb-api-2.2.2.jar                      orc-shims-1.5.10.jar                     spark-tags_2.12-3.0.2.jar
azure-storage-2.0.0.jar                hadoop-mapreduce-client-app-2.7.4.jar        jaxb-runtime-2.3.2.jar                  oro-2.0.8.jar                            spark-unsafe_2.12-3.0.2.jar
breeze-macros_2.12-1.0.jar             hadoop-mapreduce-client-common-2.7.4.jar     jcl-over-slf4j-1.7.30.jar               osgi-resource-locator-1.0.3.jar          spire-macros_2.12-0.17.0-M1.jar
breeze_2.12-1.0.jar                    hadoop-mapreduce-client-core-2.7.4.jar       jersey-client-2.30.jar                  paranamer-2.8.jar                        spire-platform_2.12-0.17.0-M1.jar
cats-kernel_2.12-2.0.0-M4.jar          hadoop-mapreduce-client-jobclient-2.7.4.jar  jersey-common-2.30.jar                  parquet-column-1.10.1.jar                spire-util_2.12-0.17.0-M1.jar
chill-java-0.9.5.jar                   hadoop-mapreduce-client-shuffle-2.7.4.jar    jersey-container-servlet-2.30.jar       parquet-common-1.10.1.jar                spire_2.12-0.17.0-M1.jar
chill_2.12-0.9.5.jar                   hadoop-openstack-2.7.4.jar                   jersey-container-servlet-core-2.30.jar  parquet-encoding-1.10.1.jar              stax-api-1.0-2.jar
commons-beanutils-1.9.4.jar            hadoop-yarn-api-2.7.4.jar                    jersey-hk2-2.30.jar                     parquet-format-2.4.0.jar                 stream-2.9.6.jar
commons-cli-1.2.jar                    hadoop-yarn-client-2.7.4.jar                 jersey-media-jaxb-2.30.jar              parquet-hadoop-1.10.1.jar                threeten-extra-1.5.0.jar
commons-codec-1.10.jar                 hadoop-yarn-common-2.7.4.jar                 jersey-server-2.30.jar                  parquet-jackson-1.10.1.jar               univocity-parsers-2.9.0.jar
commons-collections-3.2.2.jar          hadoop-yarn-server-common-2.7.4.jar          jetty-sslengine-6.1.26.jar              protobuf-java-2.5.0.jar                  xbean-asm7-shaded-4.15.jar
commons-compiler-3.0.16.jar            hive-storage-api-2.7.1.jar                   jetty-util-6.1.26.jar                   py4j-0.10.9.jar                          xercesImpl-2.12.0.jar
commons-compress-1.20.jar              hk2-api-2.6.1.jar                            jetty-util-9.4.34.v20201102.jar         pyrolite-4.30.jar                        xml-apis-1.4.01.jar
commons-configuration-1.6.jar          hk2-locator-2.6.1.jar                        joda-time-2.10.5.jar                    scala-collection-compat_2.12-2.1.1.jar   xmlenc-0.52.jar
commons-crypto-1.1.0.jar               hk2-utils-2.6.1.jar                          json4s-ast_2.12-3.6.6.jar               scala-compiler-2.12.10.jar               xz-1.5.jar
commons-digester-1.8.jar               htrace-core-3.1.0-incubating.jar             json4s-core_2.12-3.6.6.jar              scala-library-2.12.10.jar                zookeeper-3.4.14.jar
commons-httpclient-3.1.jar             httpclient-4.5.6.jar                         json4s-jackson_2.12-3.6.6.jar           scala-parser-combinators_2.12-1.1.2.jar  zstd-jni-1.4.4-3.jar
commons-io-2.4.jar                     httpcore-4.4.12.jar                          json4s-scalap_2.12-3.6.6.jar            scala-reflect-2.12.10.jar
commons-lang-2.6.jar                   istack-commons-runtime-3.0.8.jar             jsp-api-2.1.jar                         scala-xml_2.12-1.2.0.jar

直觉

我认为该错误与某种内部不匹配有关hadoop-awsversion 和hadoop-common. 但是,我不明白如何通过从 pyspark 向 SparkSession 传递配置来解决/解决问题,或者如何编译 spark 以解决这些问题。

标签: amazon-web-servicesapache-sparkamazon-s3pyspark

解决方案


推荐阅读