首页 > 解决方案 > 在 PySpark 中将十进制信息解码为二进制信息

问题描述

我有一个关于在 PySpark 中将十进制解码为二进制值的问题。这就是我在普通 python 中的做法:

a = 28
b = format(a, "09b")
print(b)

-> 000011100

这是我要转换的示例 DataFrame:

from pyspark import Row
from pyspark.sql import SparkSession

df = spark.createDataFrame([Row(a=1, b='28', c='11', d='foo'),
                            Row(a=2, b='28', c='44', d='bar'),
                            Row(a=3, b='28', c='22', d='foo')])

|  a|  b|  c|  d|
+---+---+---+---+
|  1| 28| 11|foo|
|  2| 28| 44|bar|
|  3| 28| 22|foo|
+---+---+---+---+

我希望将“b”列解码为:

|  a|        b|  c|  d|
+---+---------+---+---+
|  1|000011100| 11|foo|
|  2|000011100| 44|bar|
|  3|000011100| 22|foo|
+---+---------+---+---+

谢谢你的帮助!

标签: pythonapache-sparkpysparkapache-spark-sql

解决方案


具有binlpad功能以达到相同的输出

import pyspark.sql.functions as f
from pyspark import Row
from pyspark.shell import spark

df = spark.createDataFrame([Row(a=1, b='28', c='11', d='foo'),
                            Row(a=2, b='28', c='44', d='bar'),
                            Row(a=3, b='28', c='22', d='foo')])

df = df.withColumn('b', f.lpad(f.bin(df['b']), 9, '0'))
df.show()

使用 UDF

import pyspark.sql.functions as f
from pyspark import Row
from pyspark.shell import spark

df = spark.createDataFrame([Row(a=1, b='28', c='11', d='foo'),
                            Row(a=2, b='28', c='44', d='bar'),
                            Row(a=3, b='28', c='22', d='foo')])


@f.udf()
def to_binary(value):
    return format(int(value), "09b")


df = df.withColumn('b', to_binary(df['b']))
df.show()

输出:

+---+---------+---+---+
|  a|        b|  c|  d|
+---+---------+---+---+
|  1|000011100| 11|foo|
|  2|000011100| 44|bar|
|  3|000011100| 22|foo|
+---+---------+---+---+

推荐阅读