首页 > 解决方案 > 合并数组中的 Apache Spark 列和结构数组中的结构

问题描述

这是传入数据流的模式。我使用 spark 2.3.2 流式处理数据。

val schema = StructType(Seq(
            StructField("status", StringType),
            StructField("data", StructType(Seq(
                StructField("resultType", StringType),
                StructField("result", ArrayType(StructType(Array(
                    StructField("metric", StructType(Seq(StructField("application", StringType),
                                                         StructField("component", StringType),
                                                         StructField("instance", StringType)))), 
                    StructField("value", ArrayType(StringType))
                ))))
             )
         )))) 

这是我将模式应用于 dstream 的 rdd 的方式。

  val df = rdd.toDS()                        
                    .selectExpr("cast (value as string) as myData") 
                    .select(from_json($"myData", schema).as("myData"))               
                    .select($"myData.data.*")
                    .select("result")

上面的代码产生以下输出

{"result":[{"metric":{"application":"A","component":"S","instance":"tp01.net:9072"},"value":["1.542972576979E9","237006995456"]},
       {"metric":{"application":"A","component":"S","instance":"tp02.net:9072"},"value":["1.542972576979E9","237006995456"]},
       {"metric":{"application":"A","component":"S","instance":"tp03.net:9072"},"value":["1.542972576979E9","237006995456"]},
       {"metric":{"application":"B","component":"S","instance":"bp03.net:9072"},"value":["1.542972576979E9","270860144640"]},
       {"metric":{"application":"B","component":"S","instance":"bp04.net:9072"},"value":["1.542972576979E9","270860144640"]},
       {"metric":{"application":"B","component":"S","instance":"ps01.net:9072"},"value":["1.542972576979E9","135177400320"]},
 ]}

但是为了提取特征,我需要将上面的转换为下面的数据框

application     component       instance            value1              value2
A               S               tp01.net:9072       1.542972576979E9    237006995456
A               S               tp02.net:9072       1.542972576979E9    237006995456
A               S               tp03.net:9072       1.542972576979E9    237006995456
B               S               bp03.net:9072       1.542972576979E9    270860144640
B               S               bp04.net:9072       1.542972576979E9    270860144640
B               S               ps01.net:9072       1.542972576979E9    135177400320

如您所见,每一行已经是分解的行。关于如何将数组值和结构选择到单个数据框中的任何想法?

谢谢

标签: scalaapache-sparkapache-spark-sql

解决方案


推荐阅读