apache-spark - Spark,如何从数据框中获取旋转的列名?
问题描述
我旋转一列,它会生成多个新列。
我想获取这些列并将其打包在一个字段下。
下面的代码给了我想要的结果。
但是我手动选择col("search"), col("main"), col("theme")
,我想知道是否有一种方法可以动态选择所有这些列(我可以说是透视列吗?))
# I'm going to pivot on the 2nd column
mylist = [
[1, 'search', 3, 1],
[1, 'search', 3, 2],
[1, 'main', 5, 3],
[1, 'main', 6, 4],
[2, 'search', 4, 10],
[2, 'search', 4, 11],
[2, 'main', 6, 12],
[2, 'main', 6, 13],
[2, 'theme', 6, 14],
[3, 'search', 4, 5],
[3, 'main', 6, 6],
[3, 'main', 6, 7],
[3, 'theme', 6, 8],
]
df = pd.DataFrame(mylist, columns=['id', 'origin', 'time', 'screen_index'])
mylist = df.to_dict('records')
spark_session = get_spark_session()
df = spark_session.createDataFrame(Row(**x) for x in mylist)
df_wanted = df.groupBy("id").pivot('origin').agg(
struct(count(lit(1)).alias('count'), avg("time").alias('avg_time'))
).withColumn(
#### here I'm manually selecting columns, but want to grab them dynamically because I don't know beforehand what they gonna be.
"origin_info", struct(col("search"), col("main"), col("theme"))
).select("id", "origin_info")
df_wanted.printSchema()
root
|-- id: long (nullable = true)
|-- origin_info: struct (nullable = false)
| |-- search: struct (nullable = false)
| | |-- count: long (nullable = false)
| | |-- avg_time: double (nullable = true)
| |-- main: struct (nullable = false)
| | |-- count: long (nullable = false)
| | |-- avg_time: double (nullable = true)
| |-- theme: struct (nullable = false)
| | |-- count: long (nullable = false)
| | |-- avg_time: double (nullable = true)
解决方案
其实我想我想通了。
虽然我不知道它的性能..
我从https://stackoverflow.com/a/41011195/433570得到了提示
names = df_wanted.schema.names.copy()
names.remove("id")
columns = [col(name) for name in names]
df_wanted = df_wanted.withColumn(
"origin_info", struct(*columns)
).select("id", "origin_info")
推荐阅读
- c# - ASP CORE Identity 浏览器登录失败,但通过 UserManager.CheckPasswordAsync 与文字字符串工作
- spring - 春季批处理 HibernateCursorItemReaderBuilder 抛出 QuerySyntaxException
- c# - 从控制台读取空格后的变量
- java - 使用 xmlbeans、inst2xsd 和 Maven 从 XML 生成 XSD
- html - 我无法在 css 中设置任何 html/jsx 标记的样式,因为它说'h1“选择器”不是纯的'
- python - 如何将结果转换为列表-Python
- python - 从 3.5.4 传递到 3.6.8 时,如何解决 Python 正则表达式错误“错误错误转义 \m 在位置 37”
- java - 键盘打开时上移弹出窗口
- python - 变量不作为 Arg [python] 传递
- node.js - 如何在猫鼬中使父值可选,子值需要