python - 有没有办法使用 Crealytics spark-excel 包将带有 ArrayType 列的 Spark 数据帧写入 Excel?
问题描述
我想从 Spark Dataframe 创建一个 XLSX 文件。我正在使用 Crealytics 的 spark-excel 包 ( https://github.com/crealytics/spark-excel ) 来实现相同的目的。
以下是我创建 spark 数据框的方法:
my_row = Row([1,2,3,4,5] ,True, decimal.Decimal(5.66) , float(10.111) , 5 , "some string" , datetime(2019, 10, 23, 8, 36, 49, tzinfo=tzlocal()))
my_schema = StructType([StructField("array_type", ArrayType(IntegerType()), True),
StructField("boolean_type", BooleanType(), True),
StructField("decimal_type", DecimalType(), True),
StructField("double_type", DoubleType(), True),
StructField("int_type", IntegerType(), True),
StructField("string_type", StringType(), True),
StructField("timestamp_type", TimestampType(), True)])
df = spark.createDataFrame([my_row] , my_schema)
这是我尝试从 DF 创建 Excel 文件的方法:
df.write.format("com.crealytics.spark.excel").option("header","true").option("timestampformat","yyyy-MM-dd hh:mm:ss").save("path_to_excel\\crealytics_output.xlsx")
这是错误:
Py4JJavaError Traceback (most recent call last)
<ipython-input-14-83d86eb4a8f1> in <module>()
1 # df = df.drop("array_type")
----> 2 df.write.format("com.crealytics.spark.excel").option("header","true").option("timestampformat","yyyy-MM-dd hh:mm:ss").save("file:///C:\\temp_folder\\crealytics_array.xlsx")
c:\python27\lib\site-packages\pyspark\sql\readwriter.pyc in save(self, path, format, mode, partitionBy, **options)
736 self._jwrite.save()
737 else:
--> 738 self._jwrite.save(path)
739
740 @since(1.4)
c:\python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
c:\python27\lib\site-packages\pyspark\sql\utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
c:\python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o278.save.
: scala.MatchError: WrappedArray(1, 2, 3, 4, 5) (of class scala.collection.mutable.WrappedArray$ofRef)
at com.crealytics.spark.excel.AreaDataLocator$class.toCell(DataLocator.scala:117)
at com.crealytics.spark.excel.CellRangeAddressDataLocator.toCell(DataLocator.scala:134)
at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7$$anonfun$8.apply(DataLocator.scala:107)
at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7$$anonfun$8.apply(DataLocator.scala:106)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7.apply(DataLocator.scala:106)
at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7.apply(DataLocator.scala:104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at scala.collection.AbstractIterator.toList(Iterator.scala:1334)
at com.crealytics.spark.excel.AreaDataLocator$class.toSheet(DataLocator.scala:110)
at com.crealytics.spark.excel.CellRangeAddressDataLocator.toSheet(DataLocator.scala:134)
at com.crealytics.spark.excel.ExcelFileSaver.sheet$1(ExcelFileSaver.scala:36)
at com.crealytics.spark.excel.ExcelFileSaver.writeToWorkbook$1(ExcelFileSaver.scala:40)
at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:48)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:60)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
这是整个代码:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import IntegerType, StructField, StructType, StringType, LongType, ArrayType, BooleanType, \
TimestampType, DecimalType, DoubleType
from datetime import datetime
from dateutil.tz import tzlocal
import decimal
my_row = Row([1, 2, 3, 4, 5], True, decimal.Decimal(5.66), float(10.111), 5, "some string",
datetime(2019, 10, 23, 8, 36, 49, tzinfo=tzlocal()))
my_schema = StructType([StructField("array_type", ArrayType(IntegerType()), True),
StructField("boolean_type", BooleanType(), True),
StructField("decimal_type", DecimalType(), True),
StructField("double_type", DoubleType(), True),
StructField("int_type", IntegerType(), True),
StructField("string_type", StringType(), True),
StructField("timestamp_type", TimestampType(), True)])
spark = SparkSession.builder.master("local").appName("Testing Excel").getOrCreate()
df = spark.createDataFrame([my_row], my_schema)
df.show()
df.write.format("com.crealytics.spark.excel").option("header", "true").option("timestampformat",
"yyyy-MM-dd hh:mm:ss").save(
"path_to_excel\\crealytics_output.xlsx")
我将 Spark DF 转换为 Pandas DF 并使用 Pandas 的to_excel()
方法,它能够正确生成 excel 文件。我需要帮助来使用 crealytics 的 spark-excel 包创建所需的 excel 文件。
其他详细信息:Python 版本:2.7.12 Spark 版本:2.4.4 spark-excel 版本:spark-excel_2.11:0.13.1
解决方案
嗯,一切看起来都很好。有一个选项 ncamed '.append()' 你可以试试吗?所以……应该是这样的……
import org.apache.spark.sql._
val df: DataFrame = ???
df.write
.format("com.crealytics.spark.excel")
.option("header", "true")
.mode("append") // Optional, default: overwrite.
.save("path_to_excel\\crealytics_output.xlsx")
那对你有用吗?
推荐阅读
- python - 如何在 GridSearchCV 中使用 TransformedTargetRegressor?
- ios - UISplitViewController 内的 iOS 13 UITabBarController 未正确折叠
- r - R:查找“for”循环
- python - Python/Django:尝试导入 urls.py 时没有名为“xx.xx”的模块
- text - Odoo 12 CE,Qweb-text 报告包含所有 HTML 模板代码
- javascript - 从文件服务器中的文件创建 https:// 引用
- ios - compileKotlinIos FAILED - 找不到标准库
- python - 如何创建一个numpy数组?
- javascript - .save 不是函数, .map 不是更新文档时的函数
- python - 我的 A* 寻路算法并不总是得到最短路径