python - PySpark - Combine DF columns into named StructType
问题描述
I'm looking to combine multiple columns of a PySpark Data Frame into one column of the StructType
.
Let's say I have a data frame like so:
columns = ['id', 'dogs', 'cats']
vals = [(1, 2, 0),(2, 0, 1)]
df = sqlContext.createDataFrame(vals, columns)
I'd like the resulting data frame to resemble this (not as it would actually be printed but to give an idea of what I mean if you aren't already familiar with StructType):
id | animals
1 | dogs=2, cats=0
2 | dogs=0, cats=1
Right now I am able to accomplish what I want with putting this:
StructType(
[StructField('dogs', IntegerType(), True),
[StructField('cats', IntegerType(), True)
)
at the end of my udf
s, however, I'd rather just do it with a single function. I'd be surprised if one doesn't exist.
解决方案
If you need a map
column: create literal columns with the column names as keys and then use create_map
function to construct the map column you needed:
from pyspark.sql.functions import create_map, lit
new_df = df.select(
'id',
create_map(lit('dogs'), 'dogs', lit('cats'), 'cats').alias('animals')
# key : val, key : val
)
new_df.show(2, False)
#+---+----------------------+
#|id |animals |
#+---+----------------------+
#|1 |[dogs -> 2, cats -> 0]|
#|2 |[dogs -> 0, cats -> 1]|
#+---+----------------------+
new_df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- animals: map (nullable = false)
# | |-- key: string
# | |-- value: long (valueContainsNull = true)
If you need a struct
column: Use the struct
function:
from pyspark.sql.functions import struct
new_df = df.select('id', struct('dogs', 'cats').alias('animals'))
new_df.show(2, False)
#+---+-------+
#|id |animals|
#+---+-------+
#|1 |[2, 0] |
#|2 |[0, 1] |
#+---+-------+
new_df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- animals: struct (nullable = false)
# | |-- dogs: long (nullable = true)
# | |-- cats: long (nullable = true)
推荐阅读
- c++ - 如何查找日期范围是否存在于另一个日期范围中?CPP
- c++ - 在读取地图成员时调用 unordered_map::operator[] 是否安全?
- php - 使用 Google Cloud Vision API 的收据解析器
- php - md5输入的最大字符数
- sql - 比较两个数据库中具有相同表结构的两个表
- javascript - 我应该如何在 JavaScript 中使用正则表达式验证完整的字符串格式?
- reactjs - 链接标签发生反应js问题
- javascript - 显示多个选项卡的 Java 脚本会话计时器
- parallel-processing - 并行处理时不考虑 Apache Camel 调度程序延迟
- python - 如何在 lambda 函数上迭代各种迭代器?