首页 > 解决方案 > 从 Spark sql scala Row 中提取嵌套结构

问题描述

我将数据存储在df具有以下火花模式的变量中。

root
 |-- id: string (nullable = true)
 |-- mid: integer (nullable = true)
 |-- relationships: struct (nullable = true)
 |    |-- cmg: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- cid: string (nullable = true)
 |    |    |    |-- state: struct (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)

我想运行一个 map 函数,在其中我在 variable 上运行一些逻辑statevalue然后运行进一步的 reduce 操作。

我的代码如下。

df.map(mapFunc, encoder).write().format("parquet")
            .option("path", "....")
            .mode(SaveMode.Overwrite).save();

MapFunction<Row, String> mapFunc = (MapFunction<Row, String>) value -> {
            String id = value.getAs("id").toString();
            int mid = value.getAs("mid");
            Relationships relationships = value.getAs("relationships");
            return id + ", " + mid  + ", " + relationships.getCmgList().toString();
};

public class Relationships {
    @Getter
    @Setter
    private List<CMG> cmgList;
}

class CMG {
    @Getter
    private String cid;
    @Getter
    private State state;
}

class State {
    @Getter
    private String value;
}

当我运行火花作业时,作业失败说

Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to com.snapshot.spark.parquet.relationship.Relationships

在线Relationships relationships = value.getAs("relationships");

如何提取存储在relationships列中的值(最好在对象中)。

标签: apache-spark

解决方案


推荐阅读