首页 > 解决方案 > 从 spark 数据框中的列和两个新列中提取值

问题描述

我有一个看起来像这样的火花数据框:

    +----+------+-------------+
    |user| level|value_pair   |
    +----+------+-------------+
    | A  | 25   |(23.52,25.12)|
    | A  | 6    |(0,0)        |
    | A  | 2    |(11,12.12)   |
    | A  | 32   |(17,16.12)   |
    | B  | 22   |(19,57.12)   |
    | B  | 42   |(10,3.2)     |
    | B  | 43   |(32,21.0)    |
    | C  | 33   |(12,0)       |
    | D  | 32   |(265.21,19.2)|
    | D  | 62   |(57.12,50.12)|
    | D  | 32   |(75.12,57.12)|
    | E  | 63   |(0,0)        |
    +----+------+-------------+

如何提取value_pair列中的值并将它们添加到两个名为value1and的新列value2中,使用逗号作为分隔符。

    +----+------+-------------+-------+
    |user| level|value1       |value2 |
    +----+------+-------------+-------+
    | A  | 25   |23.52        |25.12  |
    | A  | 6    |0            |0      |
    | A  | 2    |11           |12.12  |
    | A  | 32   |17           |16.12  |
    | B  | 22   |19           |57.12  |
    | B  | 42   |10           |3.2    |
    | B  | 43   |32           |21.0   |
    | C  | 33   |12           |0      |
    | D  | 32   |265.21       |19.2   |
    | D  | 62   |57.12        |50.12  |
    | D  | 32   |75.12        |57.12  |
    | E  | 63   |0            |0      |
    +----+------+-------------+-------+

我知道我可以像这样分隔值:

    df = df.withColumn('value1', pyspark.sql.functions.split(df['value_pair'], ',')[0]
    df = df.withColumn('value2', pyspark.sql.functions.split(df['value_pair'], ',')[1]

但是我如何也摆脱这些parantheses?

标签: apache-sparkpysparkapache-spark-sql

解决方案


对于括号,如注释中所示,您可以使用regexp_replace,但您还需要包含\。反斜杠\是正则表达式的转义字符。

另外,我相信您需要先删除括号,然后再展开列。

from pyspark.sql.functions import split
from pyspark.sql.functions import regexp_replace

df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\(",""))
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\)",""))

df = df.withColumn('value1', split(df['value_pair'], ',').getItem(0)) \
       .withColumn('value2', split(df['value_pair'], ',').getItem(1))

>>> df.show(truncate=False)

+----+-----+-----------+------+---------+
|user|level|value_pair |value1|value2   |
+----+-----+-----------+------+---------+
| A  |25   |23.52,25.12|23.52 |25.12    |
| A  |6    |0,0        |0     |0        |
| A  |2    |11,12.12   |11    |12.12    |
| A  |32   |17,16.12   |17    |16.12    |
| B  |22   |19,57.12   |19    |57.12    |
| B  |42   |10,3.2     |10    |3.2      |
| B  |43   |32,21.0    |32    |21.0     |
| C  |33   |12,0       |12    |0        |
| D  |32   |265.21,19.2|265.21|19.2     |
| D  |62   |57.12,50.12|57.12 |50.12    |
| D  |32   |75.12,57.12|75.12 |57.12    |
| E  |63   |0,0        |0     |0        |
+----+-----+-----------+------+---------+

如前所述,我稍微更改了您如何获取 2 个项目的代码。

更多信息可以在这里找到


推荐阅读