首页 > 解决方案 > 如何从 df.collect() 结果中检索 PySpark 中的特定值?

问题描述

我在 PySparkdf中有以下 DataFrame。

import pyspark.sql.functions as func

df = spark\
        .read \
        .format("org.elasticsearch.spark.sql") \
        .load("my_index/my_mapping") \
        .groupBy(["id", "type"]) \
        .agg(
            func.count(func.lit(1)).alias("number_occurrences"),
            func.countDistinct("host_id").alias("number_hosts")
        )

ds = df.collect()

我使用collect是因为分组和聚合后的数据量总是很小并且适合内存。另外,我需要使用collect,因为我ds作为udf函数的参数传递。该函数collect返回一个数组。如何对此数组进行以下查询:对于给定的idand type,返回number_occurrencesandnumber_hosts

例如,假设df包含以下行:

id   type   number_occurrences   number_hosts
1    xxx    11                   3
2    yyy    10                   4 

完成后df.collect(),我如何检索number_occurencesnumber_hostsfor idequal to1typeequal to xxx。预期结果是:

number_occurrences = 11
number_hosts = 3

更新:

也许有更优雅的解决方案?

    id = 1
    type = "xxx"
    number_occurrences = 0
    number_hosts = 0
    for row in ds:
        if (row["id"] == id) & (row["type"] == type):
            number_occurrences = row["number_occurrences"]
            number_hosts = row["number_hosts"]

标签: pythonapache-sparkpysparkapache-spark-sql

解决方案


如果您id是唯一的(应该是 id 的情况),您可以根据 id 对数组进行排序。这只是确保正确的顺序,如果您的 id 是连续的,您可以直接访问记录并将 id 减去 1

test_df = spark.createDataFrame([
(1,"xxx",11,3),(2,"yyyy",10,4),

], ("id","type","number_occurrences","number_hosts"))
id = 1
type = "xxx"
sorted_list = sorted(test_df.collect(), cmp=lambda x,y: cmp(x["id"],y["id"]))
sorted_list[id-1]["number_occurrences"],sorted_list[id-1]["number_hosts"]

结果:

(11, 3)

推荐阅读