首页 > 解决方案 > 在 spark 数据框中使用列值转换另一列

问题描述

我有一个这样的数据框:

rdd1 = sc.parallelize([(100,2,1234.5678),(101,3,1234.5678)])
df = spark.createDataFrame(rdd1,(['id','dec','val']))

+---+---+---------+
| id|dec|      val|
+---+---+---------+
|100|  2|1234.5678|
|101|  3|1234.5678|
+---+---+---------+

根据列中可用的值dec,我希望在列上进行转换val。就像 if 一样dec = 2,那么我希望将其val转换为DecimalType(7,2).

我尝试执行以下操作,但它不起作用:

 df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,col('dec'))).cast(StringType()).alias('modVal')).show()

错误信息:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/column.py", line 419, in cast
    jdt = spark._jsparkSession.parseDataType(dataType.json())
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 69, in json
    return json.dumps(self.jsonValue(),
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 225, in jsonValue
    return "decimal(%d,%d)" % (self.precision, self.scale)
TypeError: %d format: a number is required, not Column

如果我将值硬编码为特定数字,这同样适用,这是直截了当的。

df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,3)).cast(StringType()).alias('modVal')).show()

+---+---+---------+--------+
| id|dec|      val|  modVal|
+---+---+---------+--------+
|100|  2|1234.5678|1234.568|
|101|  3|1234.5678|1234.568|
+---+---+---------+--------+

请帮我解决一下这个。

标签: pythonsqlapache-sparkpysparkapache-spark-sql

解决方案


Columns in Spark (or any relational system for that matter) have to be homogeneous - operation like this, where you cast each row to different type, is not only not supported, but also doesn't make much sense.


推荐阅读