首页 > 解决方案 > 从 scala Spark 中的 RDD[type] 获取不同的行

问题描述

假设我有一个像这样的 RDD[employee] 格式的 RDD 和如下示例数据:-

FName,LName,Department,Salary
dubert,tomasz ,paramedic i/c,91080.00,
edwards,tim p,lieutenant,114846.00,
edwards,tim p,lieutenant,234846.00,
edwards,tim p,lieutenant,354846.00,
elkins,eric j,police,104628.00,
estrada,luis f,police officer,96060.00,
ewing,marie a,clerk,53076.00,
ewing,marie a,clerk,13076.00,
ewing,marie a,clerk,63076.00,
finn,sean p,firefighter,87006.00,
fitch,jordan m,law clerk,14.51
fitch,jordan m,law clerk,14.51

预期输出:-

dubert,tomasz ,paramedic i/c,91080.00,
edwards,tim p,lieutenant,354846.00,
elkins,eric j,police,104628.00,
estrada,luis f,police officer,96060.00,
ewing,marie a,clerk,63076.00,
finn,sean p,firefighter,87006.00,
fitch,jordan m,law clerk,14.51

我想要每个基于不同 Fname 的单行

标签: scalaapache-sparkapache-spark-sql

解决方案


我想你想做这样的事情:

df
.groupBy('Fname)
.agg(
  first('LName),
  first('Department),
  first('Salary)
)

推荐阅读