首页 > 解决方案 > 在 pyspark 中使用 Split() 函数

问题描述

我正在尝试使用 Jupiter Notebook 在 pyspark 中进行编码。使用 split() 函数数据框时面临的问题我正在使用

import_csv=spark.read.csv("F:\\Learning\\PySpark\\DATA\\Iris.csv",header="true")
import_csv.show()
import_csv=spark.read.csv("F:\\Learning\\PySpark\\DATA\\Iris.csv",header="true")
import_csv.show()
+---+-------------+------------+-------------+------------+-----------+
| Id|SepalLengthCm|SepalWidthCm|PetalLengthCm|PetalWidthCm|    Species|
+---+-------------+------------+-------------+------------+-----------+
|  1|          5.1|         3.5|          1.4|         0.2|Iris-setosa|
|  2|          4.9|         3.0|          1.4|         0.2|Iris-setosa|
|  3|          4.7|         3.2|          1.3|         0.2|Iris-setosa|
|  4|          4.6|         3.1|          1.5|         0.2|Iris-setosa|
|  5|          5.0|         3.6|          1.4|         0.2|Iris-setosa|
|  6|          5.4|         3.9|          1.7|         0.4|Iris-setosa|
|  7|          4.6|         3.4|          1.4|         0.3|Iris-setosa|
|  8|          5.0|         3.4|          1.5|         0.2|Iris-setosa|
|  9|          4.4|         2.9|          1.4|         0.2|Iris-setosa|
| 10|          4.9|         3.1|          1.5|         0.1|Iris-setosa|
| 11|          5.4|         3.7|          1.5|         0.2|Iris-setosa|
| 12|          4.8|         3.4|          1.6|         0.2|Iris-setosa|
| 13|          4.8|         3.0|          1.4|         0.1|Iris-setosa|
| 14|          4.3|         3.0|          1.1|         0.1|Iris-setosa|
| 15|          5.8|         4.0|          1.2|         0.2|Iris-setosa|
| 16|          5.7|         4.4|          1.5|         0.4|Iris-setosa|
| 17|          5.4|         3.9|          1.3|         0.4|Iris-setosa|
| 18|          5.1|         3.5|          1.4|         0.3|Iris-setosa|
| 19|          5.7|         3.8|          1.7|         0.3|Iris-setosa|
| 20|          5.1|         3.8|          1.5|         0.3|Iris-setosa|
+---+-------------+------------+-------------+------------+-----------+
only showing top 20 rows

尝试根据“,”(逗号)拆分每一行RDD

csv_split= import_csv.rdd.map(lambda var1: var1.split(','))
print(csv_split.collect())

出现“拆分”错误不在列表中

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 65.0 failed 1 times, most recent failure: Lost task 0.0 in stage 65.0 (TID 65, DESKTOP-NPEMBC9, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\spark-3.0.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\types.py", line 1595, in __getattr__
    idx = self.__fields__.index(item)
ValueError: 'split' is not in list

标签: pythonapache-sparkpyspark

解决方案


在 Spark 中 split() 用于根据某个标识符将字符串/列拆分/拆分为多个并返回 List/ AttayType

df_b = spark.createDataFrame([('1','ABC-07-DEF')],[ "ID","col1"])
df_b = df_b.withColumn('post_split', F.split(F.col('col1'), "-"))
df_b.show()

+---+----------+--------------+
| ID|      col1|    post_split|
+---+----------+--------------+
|  1|ABC-07-DEF|[ABC, 07, DEF]|
+---+----------+--------------+

此外,您可以使用getItem()从该 Arry 列中提取列,如下所示

df_b = df_b.withColumn('split_col1', F.col('post_split').getItem(0)).withColumn('split_col2', F.col('post_split').getItem(1)).withColumn('split_col3', F.col('post_split').getItem(2))
df_b.show()

+---+----------+--------------+----------+----------+----------+
| ID|      col1|    post_split|split_col1|split_col2|split_col3|
+---+----------+--------------+----------+----------+----------+
|  1|ABC-07-DEF|[ABC, 07, DEF]|       ABC|        07|       DEF|
+---+----------+--------------+----------+----------+----------+

推荐阅读