首页 > 解决方案 > 选择 distinct 在 Apache Spark DataFrame 中不起作用

问题描述

我正在运行以下代码来定义案例类:

scala> case class AadharDetails (DateType: Int, Registrar: String,PrivateAgency: String, State: String, District: String, SubDistrict :String, PinCode: Int, Gender: String, Age: Int, AadharGenerated : Int, Rejected: Int, MobileNo: Int,email_id: Int)

定义类 AadharDetails

使用案例类创建 DataFrame

scala> val df = spark.read.textFile("/home/anil/spark-2.0.2-bin-   hadoop2.6/aadhaar_data.csv").map(_.split(",")).map(attributes=>AadharDetails (attributes(0).trim.toInt, attributes(1), attributes(2), attributes(3), attributes(4), attributes(5), attributes(6).trim.toInt, attributes(7),attributes(8).trim.toInt, attributes(9).trim.toInt, attributes(10).trim.toInt, attributes(11).trim.toInt, attributes(12).trim.toInt)).toDF()

df: org.apache.spark.sql.DataFrame = [DateType: int, Registrar: string ... 11 more fields]

scala> df.printSchema()
root
|-- DateType: integer (nullable = true)
|-- Registrar: string (nullable = true)
|-- PrivateAgency: string (nullable = true)
|-- State: string (nullable = true)
|-- District: string (nullable = true)
|-- SubDistrict: string (nullable = true)
|-- PinCode: integer (nullable = true)
|-- Gender: string (nullable = true)
|-- Age: integer (nullable = true)
|-- AadharGenerated: integer (nullable = true)
|-- Rejected: integer (nullable = true)
|-- MobileNo: integer (nullable = true)
|-- email_id: integer (nullable = true)


 df.createOrReplaceTempView("data")


scala> spark.sql("select distinct DateType from data").show()
**Will throw an error**, please let me know why distinct does not work here..!!

样本数据:20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Delhi,South Delhi,Defence Colony,110025,F,49,1,0,0,1

20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Delhi,South Delhi,Defence Colony,110025,F,65,1,0,0,0 在此处输入图像描述

标签: apache-sparkapache-spark-sql

解决方案


这可能是由于 DateTypevalues 列中的数据类型不兼容而发生的。它可能包含不被解释为有效整数表示的空值或字符串:

您也可能会收到数据框显示错误。 scala> df.show()

检查您的源数据以确认数据类型不匹配的问题。


推荐阅读