首页 > 解决方案 > count occurrences of each distinct value of all columns(300 columns) in a spark dataframe

问题描述

I have a spark dataframe with 300 columns and each column has 10 distinct values. I need to calculate the count occurrences of distinct values for all 300 columns.

  --------------------------------------------------------
     col1    |  col2    | col3 ............col299   | col 300
  -------------------------------------------------------
  value11    | value21  | value31       | value300  | value 301
  value12    | value22  | value32       | value300  | value 301
  value11    | value22  | value33       | value301  | value 302
  value12    | value21  | value33       | value301  | value 302

If single column i calculate using below code

import org.apache.spark.sql.functions.count
df.groupBy("col1").agg(count("col1")).show

But how to calculate efficiently for 300 columns. Please help!

标签: apache-sparkapache-spark-sql

解决方案


you can easily do it as mentioned below

first collect all column names and transformation as key values. like below

val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap

then simple df.groupBy("col1").agg(exprs) will provide you all columns distinct values.

Reference: https://spark.apache.org/docs/2.3.1/api/sql/index.html#approx_count_distinct


推荐阅读