amazon-web-services - 使用 PySpark 写入 Amazon S3 时,我得到 org/apache/hadoop/fs/StreamCapabilities
问题描述
问题:
我正在尝试将 hadoop-aws 与 pyspark 一起使用,以便能够从 Amazon S3 读取/写入文件。
方法
安装包
通过将其Mavenhadoop-aws
坐标及其依赖项传递给spark.jars.packages
. 但是,我遇到了org/apache/hadoop/fs/StreamCapabilities
错误。
编译火花
./spark-3.0.2/dev/make-distribution.sh --name spark-3.0.2-bin-hadoop2.7.5 --pip --tgz -Phadoop-cloud -Dhadoop.version=2.7.5
当我使用编译版本时,我也得到了同样的错误org/apache/hadoop/fs/StreamCapabilities
。
这是 spark-3.0.2/jars 的内容
JLargeArrays-1.5.jar commons-lang3-3.9.jar ivy-2.4.0.jar jsr305-3.0.0.jar shapeless_2.12-2.3.3.jar
JTransforms-3.1.jar commons-logging-1.1.3.jar jackson-annotations-2.10.0.jar jul-to-slf4j-1.7.30.jar shims-0.7.45.jar
RoaringBitmap-0.7.45.jar commons-math3-3.4.1.jar jackson-core-2.10.0.jar kryo-shaded-4.0.2.jar slf4j-api-1.7.30.jar
activation-1.1.1.jar commons-net-3.1.jar jackson-core-asl-1.9.13.jar leveldbjni-all-1.8.jar slf4j-log4j12-1.7.30.jar
aircompressor-0.10.jar commons-text-1.6.jar jackson-databind-2.10.0.jar log4j-1.2.17.jar snappy-java-1.1.8.2.jar
algebra_2.12-2.0.0-M2.jar compress-lzf-1.0.3.jar jackson-dataformat-cbor-2.10.0.jar lz4-java-1.7.1.jar spark-catalyst_2.12-3.0.2.jar
antlr4-runtime-4.7.1.jar core-1.1.2.jar jackson-jaxrs-1.9.13.jar machinist_2.12-0.6.8.jar spark-core_2.12-3.0.2.jar
aopalliance-repackaged-2.6.1.jar curator-client-2.7.1.jar jackson-mapper-asl-1.9.13.jar macro-compat_2.12-1.1.1.jar spark-graphx_2.12-3.0.2.jar
apacheds-i18n-2.0.0-M15.jar curator-framework-2.7.1.jar jackson-module-paranamer-2.10.0.jar metrics-core-4.1.1.jar spark-hadoop-cloud_2.12-3.0.2.jar
apacheds-kerberos-codec-2.0.0-M15.jar curator-recipes-2.7.1.jar jackson-module-scala_2.12-2.10.0.jar metrics-graphite-4.1.1.jar spark-kvstore_2.12-3.0.2.jar
api-asn1-api-1.0.0-M20.jar flatbuffers-java-1.9.0.jar jackson-xc-1.9.13.jar metrics-jmx-4.1.1.jar spark-launcher_2.12-3.0.2.jar
api-util-1.0.0-M20.jar gson-2.2.4.jar jakarta.annotation-api-1.3.5.jar metrics-json-4.1.1.jar spark-mllib-local_2.12-3.0.2.jar
arpack_combined_all-0.1.jar guava-14.0.1.jar jakarta.inject-2.6.1.jar metrics-jvm-4.1.1.jar spark-mllib_2.12-3.0.2.jar
arrow-format-0.15.1.jar hadoop-annotations-2.7.5.jar jakarta.validation-api-2.0.2.jar minlog-1.3.0.jar spark-network-common_2.12-3.0.2.jar
arrow-memory-0.15.1.jar hadoop-auth-2.7.5.jar jakarta.ws.rs-api-2.1.6.jar netty-all-4.1.47.Final.jar spark-network-shuffle_2.12-3.0.2.jar
arrow-vector-0.15.1.jar hadoop-aws-2.7.5.jar jakarta.xml.bind-api-2.3.2.jar objenesis-2.5.1.jar spark-repl_2.12-3.0.2.jar
audience-annotations-0.5.0.jar hadoop-azure-2.7.5.jar janino-3.0.16.jar opencsv-2.3.jar spark-sketch_2.12-3.0.2.jar
avro-1.8.2.jar hadoop-client-2.7.5.jar javassist-3.25.0-GA.jar orc-core-1.5.10.jar spark-sql_2.12-3.0.2.jar
avro-ipc-1.8.2.jar hadoop-common-2.7.5.jar javax.servlet-api-3.1.0.jar orc-mapreduce-1.5.10.jar spark-streaming_2.12-3.0.2.jar
avro-mapred-1.8.2-hadoop2.jar hadoop-hdfs-2.7.5.jar jaxb-api-2.2.2.jar orc-shims-1.5.10.jar spark-tags_2.12-3.0.2.jar
azure-storage-2.0.0.jar hadoop-mapreduce-client-app-2.7.5.jar jaxb-runtime-2.3.2.jar oro-2.0.8.jar spark-unsafe_2.12-3.0.2.jar
breeze-macros_2.12-1.0.jar hadoop-mapreduce-client-common-2.7.5.jar jcl-over-slf4j-1.7.30.jar osgi-resource-locator-1.0.3.jar spire-macros_2.12-0.17.0-M1.jar
breeze_2.12-1.0.jar hadoop-mapreduce-client-core-2.7.5.jar jersey-client-2.30.jar paranamer-2.8.jar spire-platform_2.12-0.17.0-M1.jar
cats-kernel_2.12-2.0.0-M4.jar hadoop-mapreduce-client-jobclient-2.7.5.jar jersey-common-2.30.jar parquet-column-1.10.1.jar spire-util_2.12-0.17.0-M1.jar
chill-java-0.9.5.jar hadoop-mapreduce-client-shuffle-2.7.5.jar jersey-container-servlet-2.30.jar parquet-common-1.10.1.jar spire_2.12-0.17.0-M1.jar
chill_2.12-0.9.5.jar hadoop-openstack-2.7.5.jar jersey-container-servlet-core-2.30.jar parquet-encoding-1.10.1.jar stax-api-1.0-2.jar
commons-beanutils-1.9.4.jar hadoop-yarn-api-2.7.5.jar jersey-hk2-2.30.jar parquet-format-2.4.0.jar stream-2.9.6.jar
commons-cli-1.2.jar hadoop-yarn-client-2.7.5.jar jersey-media-jaxb-2.30.jar parquet-hadoop-1.10.1.jar threeten-extra-1.5.0.jar
commons-codec-1.10.jar hadoop-yarn-common-2.7.5.jar jersey-server-2.30.jar parquet-jackson-1.10.1.jar univocity-parsers-2.9.0.jar
commons-collections-3.2.2.jar hadoop-yarn-server-common-2.7.5.jar jetty-sslengine-6.1.26.jar protobuf-java-2.5.0.jar xbean-asm7-shaded-4.15.jar
commons-compiler-3.0.16.jar hive-storage-api-2.7.1.jar jetty-util-6.1.26.jar py4j-0.10.9.jar xercesImpl-2.12.0.jar
commons-compress-1.20.jar hk2-api-2.6.1.jar jetty-util-9.4.34.v20201102.jar pyrolite-4.30.jar xml-apis-1.4.01.jar
commons-configuration-1.6.jar hk2-locator-2.6.1.jar joda-time-2.10.5.jar scala-collection-compat_2.12-2.1.1.jar xmlenc-0.52.jar
commons-crypto-1.1.0.jar hk2-utils-2.6.1.jar json4s-ast_2.12-3.6.6.jar scala-compiler-2.12.10.jar xz-1.5.jar
commons-digester-1.8.jar htrace-core-3.1.0-incubating.jar json4s-core_2.12-3.6.6.jar scala-library-2.12.10.jar zookeeper-3.4.14.jar
commons-httpclient-3.1.jar httpclient-4.5.6.jar json4s-jackson_2.12-3.6.6.jar scala-parser-combinators_2.12-1.1.2.jar zstd-jni-1.4.4-3.jar
commons-io-2.4.jar httpcore-4.4.12.jar json4s-scalap_2.12-3.6.6.jar scala-reflect-2.12.10.jar
commons-lang-2.6.jar istack-commons-runtime-3.0.8.jar jsp-api-2.1.jar scala-xml_2.12-1.2.0.jar
仅使用 hadoop-cloud 编译 spark
./spark-3.0.2/dev/make-distribution.sh --name spark-3.0.2 --pip --tgz -Phadoop-cloud
当我尝试在 Amazon S3 上保存文件时,我收到以下错误:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/store/EtagChecksum
这里构建的罐子:
JLargeArrays-1.5.jar commons-lang3-3.9.jar ivy-2.4.0.jar jsr305-3.0.0.jar shapeless_2.12-2.3.3.jar
JTransforms-3.1.jar commons-logging-1.1.3.jar jackson-annotations-2.10.0.jar jul-to-slf4j-1.7.30.jar shims-0.7.45.jar
RoaringBitmap-0.7.45.jar commons-math3-3.4.1.jar jackson-core-2.10.0.jar kryo-shaded-4.0.2.jar slf4j-api-1.7.30.jar
activation-1.1.1.jar commons-net-3.1.jar jackson-core-asl-1.9.13.jar leveldbjni-all-1.8.jar slf4j-log4j12-1.7.30.jar
aircompressor-0.10.jar commons-text-1.6.jar jackson-databind-2.10.0.jar log4j-1.2.17.jar snappy-java-1.1.8.2.jar
algebra_2.12-2.0.0-M2.jar compress-lzf-1.0.3.jar jackson-dataformat-cbor-2.10.0.jar lz4-java-1.7.1.jar spark-catalyst_2.12-3.0.2.jar
antlr4-runtime-4.7.1.jar core-1.1.2.jar jackson-jaxrs-1.9.13.jar machinist_2.12-0.6.8.jar spark-core_2.12-3.0.2.jar
aopalliance-repackaged-2.6.1.jar curator-client-2.7.1.jar jackson-mapper-asl-1.9.13.jar macro-compat_2.12-1.1.1.jar spark-graphx_2.12-3.0.2.jar
apacheds-i18n-2.0.0-M15.jar curator-framework-2.7.1.jar jackson-module-paranamer-2.10.0.jar metrics-core-4.1.1.jar spark-hadoop-cloud_2.12-3.0.2.jar
apacheds-kerberos-codec-2.0.0-M15.jar curator-recipes-2.7.1.jar jackson-module-scala_2.12-2.10.0.jar metrics-graphite-4.1.1.jar spark-kvstore_2.12-3.0.2.jar
api-asn1-api-1.0.0-M20.jar flatbuffers-java-1.9.0.jar jackson-xc-1.9.13.jar metrics-jmx-4.1.1.jar spark-launcher_2.12-3.0.2.jar
api-util-1.0.0-M20.jar gson-2.2.4.jar jakarta.annotation-api-1.3.5.jar metrics-json-4.1.1.jar spark-mllib-local_2.12-3.0.2.jar
arpack_combined_all-0.1.jar guava-14.0.1.jar jakarta.inject-2.6.1.jar metrics-jvm-4.1.1.jar spark-mllib_2.12-3.0.2.jar
arrow-format-0.15.1.jar hadoop-annotations-2.7.4.jar jakarta.validation-api-2.0.2.jar minlog-1.3.0.jar spark-network-common_2.12-3.0.2.jar
arrow-memory-0.15.1.jar hadoop-auth-2.7.4.jar jakarta.ws.rs-api-2.1.6.jar netty-all-4.1.47.Final.jar spark-network-shuffle_2.12-3.0.2.jar
arrow-vector-0.15.1.jar hadoop-aws-2.7.4.jar jakarta.xml.bind-api-2.3.2.jar objenesis-2.5.1.jar spark-repl_2.12-3.0.2.jar
audience-annotations-0.5.0.jar hadoop-azure-2.7.4.jar janino-3.0.16.jar opencsv-2.3.jar spark-sketch_2.12-3.0.2.jar
avro-1.8.2.jar hadoop-client-2.7.4.jar javassist-3.25.0-GA.jar orc-core-1.5.10.jar spark-sql_2.12-3.0.2.jar
avro-ipc-1.8.2.jar hadoop-common-2.7.4.jar javax.servlet-api-3.1.0.jar orc-mapreduce-1.5.10.jar spark-streaming_2.12-3.0.2.jar
avro-mapred-1.8.2-hadoop2.jar hadoop-hdfs-2.7.4.jar jaxb-api-2.2.2.jar orc-shims-1.5.10.jar spark-tags_2.12-3.0.2.jar
azure-storage-2.0.0.jar hadoop-mapreduce-client-app-2.7.4.jar jaxb-runtime-2.3.2.jar oro-2.0.8.jar spark-unsafe_2.12-3.0.2.jar
breeze-macros_2.12-1.0.jar hadoop-mapreduce-client-common-2.7.4.jar jcl-over-slf4j-1.7.30.jar osgi-resource-locator-1.0.3.jar spire-macros_2.12-0.17.0-M1.jar
breeze_2.12-1.0.jar hadoop-mapreduce-client-core-2.7.4.jar jersey-client-2.30.jar paranamer-2.8.jar spire-platform_2.12-0.17.0-M1.jar
cats-kernel_2.12-2.0.0-M4.jar hadoop-mapreduce-client-jobclient-2.7.4.jar jersey-common-2.30.jar parquet-column-1.10.1.jar spire-util_2.12-0.17.0-M1.jar
chill-java-0.9.5.jar hadoop-mapreduce-client-shuffle-2.7.4.jar jersey-container-servlet-2.30.jar parquet-common-1.10.1.jar spire_2.12-0.17.0-M1.jar
chill_2.12-0.9.5.jar hadoop-openstack-2.7.4.jar jersey-container-servlet-core-2.30.jar parquet-encoding-1.10.1.jar stax-api-1.0-2.jar
commons-beanutils-1.9.4.jar hadoop-yarn-api-2.7.4.jar jersey-hk2-2.30.jar parquet-format-2.4.0.jar stream-2.9.6.jar
commons-cli-1.2.jar hadoop-yarn-client-2.7.4.jar jersey-media-jaxb-2.30.jar parquet-hadoop-1.10.1.jar threeten-extra-1.5.0.jar
commons-codec-1.10.jar hadoop-yarn-common-2.7.4.jar jersey-server-2.30.jar parquet-jackson-1.10.1.jar univocity-parsers-2.9.0.jar
commons-collections-3.2.2.jar hadoop-yarn-server-common-2.7.4.jar jetty-sslengine-6.1.26.jar protobuf-java-2.5.0.jar xbean-asm7-shaded-4.15.jar
commons-compiler-3.0.16.jar hive-storage-api-2.7.1.jar jetty-util-6.1.26.jar py4j-0.10.9.jar xercesImpl-2.12.0.jar
commons-compress-1.20.jar hk2-api-2.6.1.jar jetty-util-9.4.34.v20201102.jar pyrolite-4.30.jar xml-apis-1.4.01.jar
commons-configuration-1.6.jar hk2-locator-2.6.1.jar joda-time-2.10.5.jar scala-collection-compat_2.12-2.1.1.jar xmlenc-0.52.jar
commons-crypto-1.1.0.jar hk2-utils-2.6.1.jar json4s-ast_2.12-3.6.6.jar scala-compiler-2.12.10.jar xz-1.5.jar
commons-digester-1.8.jar htrace-core-3.1.0-incubating.jar json4s-core_2.12-3.6.6.jar scala-library-2.12.10.jar zookeeper-3.4.14.jar
commons-httpclient-3.1.jar httpclient-4.5.6.jar json4s-jackson_2.12-3.6.6.jar scala-parser-combinators_2.12-1.1.2.jar zstd-jni-1.4.4-3.jar
commons-io-2.4.jar httpcore-4.4.12.jar json4s-scalap_2.12-3.6.6.jar scala-reflect-2.12.10.jar
commons-lang-2.6.jar istack-commons-runtime-3.0.8.jar jsp-api-2.1.jar scala-xml_2.12-1.2.0.jar
直觉
我认为该错误与某种内部不匹配有关hadoop-aws
version 和hadoop-common
. 但是,我不明白如何通过从 pyspark 向 SparkSession 传递配置来解决/解决问题,或者如何编译 spark 以解决这些问题。
解决方案
推荐阅读
- python - Gurobi 说模型不可行,但我能够手动找到解决方案
- java - 在单击按钮时将数据插入 Android Studio 中的 SQlite 数据库
- c# - 按视图名称 WPF MVVM 返回 ViewModel
- r - 压缩文件:RDS(R 编程语言)与 CSV(Excel)
- javascript - 使用键盘制表键 tabindex 元素未显示
- node.js - dokku 部署后的“欢迎使用 nginx”
- r - 如何对不同的放电站进行主成分分析?
- python - 硒 python 导航
- javascript - CRA + Inversify @inject 模块解析失败:意外字符'@'
- python - 仅包含特定区域的绘图区域(Python)