scala - 如何使用另一列中特定行的值填充新的 spark 数据框列。需要建议
问题描述
我的问题是这样的:
I have a spark dataframe that looks like this
+-----------+---------------+
| id| name|
+-----------+---------------+
| 1| Total:|
| 2| Male:|
| 3| Under 5 years|
| 4| 5 to 9 years|
| 5| 10 to 14 years|
| 6| Female:|
| 7| Under 5 years|
| 8| 5 to 9 years|
| 9| 10 to 14 years|
+-----------+---------------+
I want to create a new DF with an added column that will look like this:
+-----------+---------------+---------------------+
| id| name| new_name|
+-----------+---------------+---------------------+
| 1| Total:| Total:|
| 2| Male:| Male:|
| 3| Under 5 years| Male: Under 5 years|
| 4| 5 to 9 years| Male: Under 5 years|
| 5| 10 to 14 years| Male: Under 5 years|
| 6| Female:| Female:|
| 7| Under 5 years|Female: Under 5 years|
| 8| 5 to 9 years|Female: Under 5 years|
| 9| 10 to 14 years|Female: Under 5 years|
+-----------+---------------+---------------------+
我没有任何值得展示的代码我正在寻找解决问题的方法。我认为它会是这样的:
val dfB = dfA.withColum(row => aUDF(row))
我假设解决方案需要某种 UDF。我假设它需要循环或映射并在任何时候在名称字段中找到带有“:”的行时更新“前缀”val。但我不知道该怎么做。任何想法将不胜感激。
解决方案
Spark 2.4.3 你可以通过使用 split 和 last window 函数来实现这一点。
scala> import org.apache.spark.sql.expressions.Window
scala> var df = spark.createDataFrame(Seq((1,"Total:"), (2,"Male:"),(3, "Under 5 years"),(4,"5 to 9 years"),(5, "10 to 14 years"),(6,"Female:"),(7,"Under 5 years"),(8,"5 to 9 years"),(9, "10 to 14 years"))).toDF("id","name")
scala> df.show
+---+--------------+
| id| name|
+---+--------------+
| 1| Total:|
| 2| Male:|
| 3| Under 5 years|
| 4| 5 to 9 years|
| 5|10 to 14 years|
| 6| Female:|
| 7| Under 5 years|
| 8| 5 to 9 years|
| 9|10 to 14 years|
+---+--------------+
scala> var win =Window.orderBy(col("id"))
scala> var df2 =df.withColumn("name_1",last(when(split($"name",":")(1) ==="",$"name"),true).over(win))
scala> df2.withColumn("name",when($"name"===$"name_1",$"name").otherwise(concat($"name_1",$"name"))).drop($"name_1").show(false)
+---+---------------------+
|id |name |
+---+---------------------+
|1 |Total: |
|2 |Male: |
|3 |Male:Under 5 years |
|4 |Male:5 to 9 years |
|5 |Male:10 to 14 years |
|6 |Female: |
|7 |Female:Under 5 years |
|8 |Female:5 to 9 years |
|9 |Female:10 to 14 years|
+---+---------------------+
我认为这是您想要实现的目标,如果它解决了您的问题,请接受答案。HAppy Hadoop
推荐阅读
- c# - 用于查找值的 Dynamics 365 插件
- php - 在每个函数上调用 mysql 链接或通过它们传递链接?
- stanford-nlp - 否定用户定义的“宏”的 TokensRegex 模式
- c - 有没有办法更好地路由流程?
- python - 我怎么知道引发异常的确切命令?
- java - 如何使用 javax.ws.rs 在请求正文中传递数据的 post 端点?
- amazon-web-services - 限制对云端分发背后的 s3 静态网站的访问
- python - 无服务器部署 Pip 失败 - 手动 Pip 成功
- google-cloud-vision - Image Properties detection-Dominant Colors 中使用了什么颜色空间?
- python-3.x - 如何根据以下代码创建相似度矩阵?