python - 派斯帕克;将一列列表拆分为多列
问题描述
这个问题类似于 Pandas here中已经提出的问题。我正在使用 Google Cloud DataProc 集群来执行一个函数,因此无法将它们转换为pandas
.
我想转换以下内容:
+----+----------------------------------+-----+---------+------+--------------------+-------------+
| key| value|topic|partition|offset| timestamp|timestampType|
+----+----------------------------------+-----+---------+------+--------------------+-------------+
|null|["sepal_length","sepal_width",...]| iris| 0| 289|2021-04-11 22:32:...| 0|
|null|["5.0","3.5","1.3","0.3","setosa"]| iris| 0| 290|2021-04-11 22:32:...| 0|
|null|["4.5","2.3","1.3","0.3","setosa"]| iris| 0| 291|2021-04-11 22:32:...| 0|
|null|["4.4","3.2","1.3","0.2","setosa"]| iris| 0| 292|2021-04-11 22:32:...| 0|
|null|["5.0","3.5","1.6","0.6","setosa"]| iris| 0| 293|2021-04-11 22:32:...| 0|
|null|["5.1","3.8","1.9","0.4","setosa"]| iris| 0| 294|2021-04-11 22:32:...| 0|
|null|["4.8","3.0","1.4","0.3","setosa"]| iris| 0| 295|2021-04-11 22:32:...| 0|
+----+----------------------------------+-----+---------+------+--------------------+-------------+
变成这样:
+--------------+-------------+--------------+-------------+-------+
| sepal_length | sepal_width | petal_length | petal_width | class |
+--------------+-------------+--------------+-------------+-------+
| 5.0 | 3.5 | 1.3 | 0.3 | setosa|
| 4.5 | 2.3 | 1.3 | 0.3 | setosa|
| 4.4 | 3.2 | 1.3 | 0.2 | setosa|
| 5.0 | 3.5 | 1.6 | 0.6 | setosa|
| 5.1 | 3.8 | 1.9 | 0.4 | setosa|
| 4.8 | 3.0 | 1.4 | 0.3 | setosa|
+--------------+-------------+--------------+-------------+-------+
我该怎么做呢?任何帮助将不胜感激!
解决方案
走了很长的路,因为 py spark 相对较新。很高兴知道是否有更短的方法
在熊猫中重新创建您的数据框
df = pd.DataFrame({"value":['["sepal_length","sepal_width","petal_length","petal_width","class"]','["5.0","3.5","1.3","0.3","setosa"]','["4.5","2.3","1.3","0.3","setosa"]','["4.4","3.2","1.3","0.2","setosa"]']})
将 pandas 数据帧转换为 sdf
sdf = spark.createDataFrame(df)
我剥去角括号和
"
sdf = sdf.withColumn('value', regexp_replace(col('value'), '[\\[\\"\\]]', "")) sdf.show(truncate=False)
我用
,
df_split = sdf.select(f.split(sdf.value,",")).rdd.flatMap( lambda x: x).toDF(schema=["sepal_length","sepal_width","petal_length","petal_width","class"])
5:过滤掉非数字
df_split = df_split.filter(df_split.sepal_length != "sepal_length")
df_split.show()
+------------+-----------+------------+-----------+------+
|sepal_length|sepal_width|petal_length|petal_width| class|
+------------+-----------+------------+-----------+------+
| 5.0| 3.5| 1.3| 0.3|setosa|
| 4.5| 2.3| 1.3| 0.3|setosa|
| 4.4| 3.2| 1.3| 0.2|setosa|
+------------+-----------+------------+-----------+------+
推荐阅读
- sql - 列的 SQL 排列
- php - SQL 语法检查与您的 MariaDB 服务器版本相对应的手册,以在第 1 行的“some.one@gmail.com”附近使用正确的语法
- html - label 元素是否应该同时具有 for 属性和嵌套的 input 元素?
- mysql - 如果班级有 3 名学生,我如何创建查询来定位百分比
- sql-server - 将日期时间变量传递到动态 SQL 查询时出现问题
- powershell - 在 jams powershell 中创建 Precheck 作业
- azure-data-factory - 从 Azure 数据工厂中的第二次查找中的值筛选查找结果
- hp-uft - UFT:从excel文件中提取数据并在应用程序中动态输入的问题
- visualvm - VisualVM - 堆转储灰显
- sql-server - 如何向底层视图添加新字段?