首页 > 解决方案 > 在 Spark 中添加新记录

问题描述

我有一个数据框:

| ID | TIMESTAMP | VALUE |
  1     15:00:01    3
  1     17:04:02    2

当值为 2 时,我想在相同的时间减去 1 秒之前使用 Spark-Scala 添加一条新记录。

输出将是:

| ID | TIMESTAMP | VALUE |
  1     15:00:01    3
  1     17:04:01    2
  1     17:04:02    2

谢谢

标签: scalaapache-spark

解决方案


您可以引入一个新的列数组 - 当 value = 2 然后 Array(-1,0) 否则 Array(0) 时,然后分解该列并将时间戳添加为秒。下面的一个应该适合你。看一下这个:

scala> val df = Seq((1,"15:00:01",3),(1,"17:04:02",2)).toDF("id","timestamp","value")
df: org.apache.spark.sql.DataFrame = [id: int, timestamp: string ... 1 more field]

scala> val df2 = df.withColumn("timestamp",'timestamp.cast("timestamp"))
df2: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 1 more field]

scala> df2.show(false)
+---+-------------------+-----+
|id |timestamp          |value|
+---+-------------------+-----+
|1  |2019-03-04 15:00:01|3    |
|1  |2019-03-04 17:04:02|2    |
+---+-------------------+-----+


scala> val df3 = df2.withColumn("newc", when($"value"===lit(2),lit(Array(-1,0))).otherwise(lit(Array(0))))
df3: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 2 more fields]

scala> df3.show(false)
+---+-------------------+-----+-------+
|id |timestamp          |value|newc   |
+---+-------------------+-----+-------+
|1  |2019-03-04 15:00:01|3    |[0]    |
|1  |2019-03-04 17:04:02|2    |[-1, 0]|
+---+-------------------+-----+-------+


scala> val df4 = df3.withColumn("c_explode",explode('newc)).withColumn("timestamp2",to_timestamp(unix_timestamp('timestamp)+'c_explode))
df4: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 4 more fields]

scala> df4.select($"id",$"timestamp2",$"value").show(false)
+---+-------------------+-----+
|id |timestamp2         |value|
+---+-------------------+-----+
|1  |2019-03-04 15:00:01|3    |
|1  |2019-03-04 17:04:01|2    |
|1  |2019-03-04 17:04:02|2    |
+---+-------------------+-----+


scala>

如果你想要单独的时间部分,那么你可以这样做

scala> df4.withColumn("timestamp",from_unixtime(unix_timestamp('timestamp2),"HH:mm:ss")).select($"id",$"timestamp",$"value").show(false)
+---+---------+-----+
|id |timestamp|value|
+---+---------+-----+
|1  |15:00:01 |3    |
|1  |17:04:01 |2    |
|1  |17:04:02 |2    |
+---+---------+-----+

推荐阅读