apache-spark - count occurrences of each distinct value of all columns(300 columns) in a spark dataframe
问题描述
I have a spark dataframe with 300 columns and each column has 10 distinct values. I need to calculate the count occurrences of distinct values for all 300 columns.
--------------------------------------------------------
col1 | col2 | col3 ............col299 | col 300
-------------------------------------------------------
value11 | value21 | value31 | value300 | value 301
value12 | value22 | value32 | value300 | value 301
value11 | value22 | value33 | value301 | value 302
value12 | value21 | value33 | value301 | value 302
If single column i calculate using below code
import org.apache.spark.sql.functions.count
df.groupBy("col1").agg(count("col1")).show
But how to calculate efficiently for 300 columns. Please help!
解决方案
you can easily do it as mentioned below
first collect all column names and transformation as key values. like below
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
then simple df.groupBy("col1").agg(exprs)
will provide you all columns distinct values.
Reference: https://spark.apache.org/docs/2.3.1/api/sql/index.html#approx_count_distinct
推荐阅读
- c# - Unity Inspector 绘制具有不同名称的数组
- php - 向 MySQL 添加值
- jquery - 在点击事件上使用谷歌街景返回地址框中的地址
- google-earth - Google Earth Engine 中 Sentinel-2 图像的大气校正
- python - 列表理解中的可变长度切片
- php - 想要 Xdebug -- 使用 Win10 WSL Ubuntu 20、vscode 和 Laravel
- javascript - 如何以任何顺序键入数组
- ios - 如何从 userdefaults 键中查找匹配值
- html - 有人可以解释为什么我的页脚内容没有分成两列吗?
- firebase - Firebase 存储安全规则(图片尺寸)