scala - 如何将 Spark DataFrame 作为 CSV 存储到 Azure Blob 存储中
问题描述
我正在尝试将 Spark DataFrame 作为 CSV 从本地 Spark 群集存储在 Azure Blob 存储上
首先,我使用 Azure 帐户/帐户密钥设置配置(我不确定什么是正确的配置,所以我已经设置了所有这些)
sparkContext.getConf.set(s"fs.azure.account.key.${account}.blob.core.windows.net", accountKey)
sparkContext.hadoopConfiguration.set(s"fs.azure.account.key.${account}.dfs.core.windows.net", accountKey)
sparkContext.hadoopConfiguration.set(s"fs.azure.account.key.${account}.blob.core.windows.net", accountKey)
然后我尝试使用以下内容存储 CSV
filePath = s"wasbs://${container}@${account}.blob.core.windows.net/${prefix}/${filename}"
dataFrame.coalesce(1)
.write.format("csv")
.options(Map(
"header" -> (if (hasHeader) "true" else "false"),
"sep" -> delimiter,
"quote" -> quote
))
.save(filePath)
但是随后失败Job aborted
并出现以下堆栈跟踪
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
但是当我查看 blob 容器时,我可以看到我的文件,但是我无法在 Spark DataFrame 中读回它,我收到此错误Unable to infer schema for CSV. It must be specified manually.;
并跟踪堆栈跟踪
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184)
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
似乎该问题已在Databricks 论坛上报告!
在 Azure Blob 上存储 DataFrame 的正确方法是什么?
解决方案
事实证明,在作业失败之前存在内部错误
Caused by: java.lang.NoSuchMethodError: com.microsoft.azure.storage.blob.CloudBlob.startCopyFromBlob(Ljava/net/URI;Lcom/microsoft/azure/storage/AccessCondition;Lcom/microsoft/azure/storage/AccessCondition;Lcom/microsoft/azure/storage/blob/BlobRequestOptions;Lcom/microsoft/azure/storage/OperationContext;)Ljava/lang/String;
at org.apache.hadoop.fs.azure.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:399)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2449)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2372)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.restoreKey(NativeAzureFileSystem.java:918)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.close(NativeAzureFileSystem.java:819)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:320)
at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:149)
at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233)
at com.univocity.parsers.common.AbstractWriter.close(AbstractWriter.java:876)
... 18 more
发生的事情是在使用实际数据创建临时文件后,它试图将文件移动到用户使用CloudBlob.startCopyFromBlob
. 像往常一样,microsft 人通过将此方法重命名为CloudBlob.startCopy
.
我使用"org.apache.hadoop" % "hadoop-azure" % "3.2.1"
的是最新的"hadoop-azure"
,它似乎与旧的一样startCopyFromBlob
,所以我需要使用azure-storage
具有这种方法的旧版本,可能是 2.xx
推荐阅读
- javascript - 将图像(来自 javascript)保存在公共文件夹中(通过 php)
- spring-boot - Spring boot 2 微服务 - 将主体传播到服务
- reactjs - 使用 create-react-app 时如何更新预先存在的依赖项?
- r - 根据第三列中的行计算两列中日期之间的平均天数
- r - ggplot 中的边框功能没有按预期工作?
- php - 如何仅显示从数据库中获取的这些数据的名称?
- json - 詹金斯没有开始或重新上线 - 挂起并显示严重:无法解析提供的 JSON - [/var/lib/jenkins/cb-envelope/envelope.json]
- apache-spark - org.apache.spark.SparkException:通过spark将数据写入Hbase时写入行时任务失败
- php - 我下载的文件将其名称更改为“下载”,尽管我在标题中更改了它
- keyboard - 使用 python 访问应用程序