首页 > 解决方案 > 派斯帕克;将一列列表拆分为多列

问题描述

这个问题类似于 Pandas here中已经提出的问题。我正在使用 Google Cloud DataProc 集群来执行一个函数,因此无法将它们转换为pandas.

我想转换以下内容:

+----+----------------------------------+-----+---------+------+--------------------+-------------+
| key|                             value|topic|partition|offset|           timestamp|timestampType|
+----+----------------------------------+-----+---------+------+--------------------+-------------+
|null|["sepal_length","sepal_width",...]| iris|        0|   289|2021-04-11 22:32:...|            0|
|null|["5.0","3.5","1.3","0.3","setosa"]| iris|        0|   290|2021-04-11 22:32:...|            0|
|null|["4.5","2.3","1.3","0.3","setosa"]| iris|        0|   291|2021-04-11 22:32:...|            0|
|null|["4.4","3.2","1.3","0.2","setosa"]| iris|        0|   292|2021-04-11 22:32:...|            0|
|null|["5.0","3.5","1.6","0.6","setosa"]| iris|        0|   293|2021-04-11 22:32:...|            0|
|null|["5.1","3.8","1.9","0.4","setosa"]| iris|        0|   294|2021-04-11 22:32:...|            0|
|null|["4.8","3.0","1.4","0.3","setosa"]| iris|        0|   295|2021-04-11 22:32:...|            0|
+----+----------------------------------+-----+---------+------+--------------------+-------------+

变成这样:

+--------------+-------------+--------------+-------------+-------+
| sepal_length | sepal_width | petal_length | petal_width | class |
+--------------+-------------+--------------+-------------+-------+
| 5.0          | 3.5         | 1.3          | 0.3         | setosa| 
| 4.5          | 2.3         | 1.3          | 0.3         | setosa| 
| 4.4          | 3.2         | 1.3          | 0.2         | setosa| 
| 5.0          | 3.5         | 1.6          | 0.6         | setosa| 
| 5.1          | 3.8         | 1.9          | 0.4         | setosa| 
| 4.8          | 3.0         | 1.4          | 0.3         | setosa| 
+--------------+-------------+--------------+-------------+-------+

我该怎么做呢?任何帮助将不胜感激!

标签: pythonpandasgoogle-cloud-platformpyspark

解决方案


走了很长的路,因为 py spark 相对较新。很高兴知道是否有更短的方法

  1. 在熊猫中重新创建您的数据框

    df = pd.DataFrame({"value":['["sepal_length","sepal_width","petal_length","petal_width","class"]','["5.0","3.5","1.3","0.3","setosa"]','["4.5","2.3","1.3","0.3","setosa"]','["4.4","3.2","1.3","0.2","setosa"]']})

  2. 将 pandas 数据帧转换为 sdf

    sdf = spark.createDataFrame(df)

  3. 我剥去角括号和"

sdf = sdf.withColumn('value', regexp_replace(col('value'), '[\\[\\"\\]]', "")) sdf.show(truncate=False)

  1. 我用,

    df_split = sdf.select(f.split(sdf.value,",")).rdd.flatMap( lambda x: x).toDF(schema=["sepal_length","sepal_width","petal_length","petal_width","class"])

5:过滤掉非数字

df_split = df_split.filter(df_split.sepal_length != "sepal_length")
df_split.show()


+------------+-----------+------------+-----------+------+
|sepal_length|sepal_width|petal_length|petal_width| class|
+------------+-----------+------------+-----------+------+
|         5.0|        3.5|         1.3|        0.3|setosa|
|         4.5|        2.3|         1.3|        0.3|setosa|
|         4.4|        3.2|         1.3|        0.2|setosa|
+------------+-----------+------------+-----------+------+

推荐阅读