首页 > 解决方案 > 使用 Spark 对列中的值进行排序

问题描述

我的 Jupyter 笔记本中有一个 spark 数据框。我想对“关键字”列中的特定值进行排序。我只需要返回那些具有一个或几个匹配值的行。

这是我需要排序的列的样子。

+--------------------+
|            Keywords|
+--------------------+
|      ["apocalypse"]|
|["nuclear","physi...|
|                null|
|["childhood","imm...|
|["canned tomatoes...|
|                null|
|["american","beef...|
|["runway","ethose...|
|["taylor swift st...|
|["beauty","colleg...|
|                null|
|["curly hair|coil...|
|["glossier|shoppi...|
|["stacey abrams",...|
|["quentin taranti...|
|                null|
|["Mexican|Cinco D...|
|["Bridal Spring 2...|
|                null|
|["everyday athlet...|
+--------------------+

我想创建一个新的数据框,只有在关键字 =“美女”,“跑道”时才具有行。我该怎么办?我打算用 Python 创建一个 for 循环,但不知道如何在 Spark 数据框中创建它......任何帮助将不胜感激。

标签: apache-sparkpyspark

解决方案


Since the expected output is difficult to define, this can be used for what I have understood so far.

from pyspark.sql.types import *
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
spark = SparkSession.builder.appName('test').getOrCreate()
df = spark.createDataFrame([[["apocalypse"]],[[None]],[["beauty","test"]],[["runway","beauty"]]]).toDF("testcol")
df.show()
+----------------+
|         testcol|
+----------------+
|    [apocalypse]|
|              []|
|  [beauty, test]|
|[runway, beauty]|
+----------------+


df.filter(F.array_contains(F.col("testcol"),"beauty")|F.array_contains(F.col("testcol"),"runway")).show()
+----------------+
|         testcol|
+----------------+
|  [beauty, test]|
|[runway, beauty]|
+----------------+

推荐阅读