首页 > 解决方案 > 将重置索引添加到火花数据框

问题描述

我有一个数据框,我想根据其中一列添加要重置的索引列

--------------------
|  ColA   |  ColB  |
====================
|  G1     |  10    |
--------------------
|  G1     |  20    |
--------------------
|  G2     |  50    |
--------------------
|  G2     |  10    |
--------------------
|  G2     |  70    |
--------------------

我希望结果是

-----------------------------
|  ColA   |  ColB  |  ColC  |
=============================
|  G1     |  10    |   1    |
-----------------------------
|  G1     |  20    |   2    |
-----------------------------
|  G2     |  50    |   1    |   <== reset because ColA changed
-----------------------------
|  G2     |  10    |   2    |
-----------------------------
|  G2     |  70    |   3    |
-----------------------------

有没有像 df.withColumn("id", monotonicallyIncreasingId) 这样合适的东西?

标签: scaladataframeapache-spark

解决方案


用于Window为列进行分区colA

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("ColA").orderBy("ColB")
df.withCloumn("id", row_number.over(w))

或者,如果您想保持原始的行顺序,

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("ColA").orderBy("temp")
df.withColumn("temp", monotonically_increasing_id)
  .withCloumn("id", row_number.over(w))
  .drop("temp")

推荐阅读