首页 > 解决方案 > 有没有办法使用 Crealytics spark-excel 包将带有 ArrayType 列的 Spark 数据帧写入 Excel?

问题描述

我想从 Spark Dataframe 创建一个 XLSX 文件。我正在使用 Crealytics 的 spark-excel 包 ( https://github.com/crealytics/spark-excel ) 来实现相同的目的。

以下是我创建 spark 数据框的方法:

my_row = Row([1,2,3,4,5] ,True, decimal.Decimal(5.66) , float(10.111) , 5 , "some string" , datetime(2019, 10, 23, 8, 36, 49, tzinfo=tzlocal()))

my_schema = StructType([StructField("array_type", ArrayType(IntegerType()), True),
                       StructField("boolean_type", BooleanType(), True),
                       StructField("decimal_type", DecimalType(), True),
                       StructField("double_type", DoubleType(), True),
                       StructField("int_type", IntegerType(), True),
                       StructField("string_type", StringType(), True),
                       StructField("timestamp_type", TimestampType(), True)])

df = spark.createDataFrame([my_row] , my_schema)

这是我的数据框: 在此处输入图像描述

这是我尝试从 DF 创建 Excel 文件的方法:

df.write.format("com.crealytics.spark.excel").option("header","true").option("timestampformat","yyyy-MM-dd hh:mm:ss").save("path_to_excel\\crealytics_output.xlsx")

这是错误:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-14-83d86eb4a8f1> in <module>()
      1 # df = df.drop("array_type")
----> 2 df.write.format("com.crealytics.spark.excel").option("header","true").option("timestampformat","yyyy-MM-dd hh:mm:ss").save("file:///C:\\temp_folder\\crealytics_array.xlsx")

c:\python27\lib\site-packages\pyspark\sql\readwriter.pyc in save(self, path, format, mode, partitionBy, **options)
    736             self._jwrite.save()
    737         else:
--> 738             self._jwrite.save(path)
    739 
    740     @since(1.4)

c:\python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

c:\python27\lib\site-packages\pyspark\sql\utils.pyc in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

c:\python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o278.save.
: scala.MatchError: WrappedArray(1, 2, 3, 4, 5) (of class scala.collection.mutable.WrappedArray$ofRef)
    at com.crealytics.spark.excel.AreaDataLocator$class.toCell(DataLocator.scala:117)
    at com.crealytics.spark.excel.CellRangeAddressDataLocator.toCell(DataLocator.scala:134)
    at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7$$anonfun$8.apply(DataLocator.scala:107)
    at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7$$anonfun$8.apply(DataLocator.scala:106)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7.apply(DataLocator.scala:106)
    at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7.apply(DataLocator.scala:104)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1334)
    at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
    at scala.collection.AbstractIterator.toList(Iterator.scala:1334)
    at com.crealytics.spark.excel.AreaDataLocator$class.toSheet(DataLocator.scala:110)
    at com.crealytics.spark.excel.CellRangeAddressDataLocator.toSheet(DataLocator.scala:134)
    at com.crealytics.spark.excel.ExcelFileSaver.sheet$1(ExcelFileSaver.scala:36)
    at com.crealytics.spark.excel.ExcelFileSaver.writeToWorkbook$1(ExcelFileSaver.scala:40)
    at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:48)
    at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:60)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

这是整个代码:

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import IntegerType, StructField, StructType, StringType, LongType, ArrayType, BooleanType, \
    TimestampType, DecimalType, DoubleType

from datetime import datetime
from dateutil.tz import tzlocal
import decimal

my_row = Row([1, 2, 3, 4, 5], True, decimal.Decimal(5.66), float(10.111), 5, "some string",
             datetime(2019, 10, 23, 8, 36, 49, tzinfo=tzlocal()))

my_schema = StructType([StructField("array_type", ArrayType(IntegerType()), True),
                        StructField("boolean_type", BooleanType(), True),
                        StructField("decimal_type", DecimalType(), True),
                        StructField("double_type", DoubleType(), True),
                        StructField("int_type", IntegerType(), True),
                        StructField("string_type", StringType(), True),
                        StructField("timestamp_type", TimestampType(), True)])

spark = SparkSession.builder.master("local").appName("Testing Excel").getOrCreate()

df = spark.createDataFrame([my_row], my_schema)

df.show()

df.write.format("com.crealytics.spark.excel").option("header", "true").option("timestampformat",
                                                                              "yyyy-MM-dd hh:mm:ss").save(
    "path_to_excel\\crealytics_output.xlsx")

我将 Spark DF 转换为 Pandas DF 并使用 Pandas 的to_excel()方法,它能够正确生成 excel 文件。我需要帮助来使用 crealytics 的 spark-excel 包创建所需的 excel 文件。

其他详细信息:Python 版本:2.7.12 Spark 版本:2.4.4 spark-excel 版本:spark-excel_2.11:0.13.1

标签: pythonexcelpandasdataframeapache-spark

解决方案


嗯,一切看起来都很好。有一个选项 ncamed '.append()' 你可以试试吗?所以……应该是这样的……

import org.apache.spark.sql._

val df: DataFrame = ???
df.write
  .format("com.crealytics.spark.excel")
  .option("header", "true")
  .mode("append") // Optional, default: overwrite.
  .save("path_to_excel\\crealytics_output.xlsx")

那对你有用吗?


推荐阅读