首页 > 解决方案 > Dynamic dataframe with n columns and m rows

问题描述

Reading data from json(dynamic schema) and i'm loading that to dataframe.

Example Dataframe:

scala> import spark.implicits._
import spark.implicits._

scala> val DF = Seq(
     (1, "ABC"),
     (2, "DEF"),
     (3, "GHIJ")
     ).toDF("id", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]

scala> DF.show
+------+-----+
|id    | word|
+------+-----+
|     1|  ABC|
|     2|  DEF|
|     3| GHIJ|
+------+-----+

Requirement: Column count and names can be anything. I want to read rows in loop to fetch each column one by one. Need to process that value in subsequent flows. Need both column name and value. I'm using scala.

Python:
for i, j in df.iterrows(): 
    print(i, j) 

Need the same functionality in scala and it column name and value should be fetched separtely.

Kindly help.

标签: scalaapache-spark

解决方案


df.iterrows is not from pyspark, but from pandas. In Spark, you can use foreach :

DF
  .foreach{_ match {case Row(id:Int,word:String) => println(id,word)}}

Result :

(2,DEF)
(3,GHIJ)
(1,ABC)

I you don't know the number of columns, you cannot use unapply on Row, then just do :

DF
  .foreach(row => println(row))

Result :

[1,ABC]
[2,DEF]
[3,GHIJ]

And operate with row using its methods getAs etc


推荐阅读