apache-spark - 选择 distinct 在 Apache Spark DataFrame 中不起作用
问题描述
我正在运行以下代码来定义案例类:
scala> case class AadharDetails (DateType: Int, Registrar: String,PrivateAgency: String, State: String, District: String, SubDistrict :String, PinCode: Int, Gender: String, Age: Int, AadharGenerated : Int, Rejected: Int, MobileNo: Int,email_id: Int)
定义类 AadharDetails
使用案例类创建 DataFrame
scala> val df = spark.read.textFile("/home/anil/spark-2.0.2-bin- hadoop2.6/aadhaar_data.csv").map(_.split(",")).map(attributes=>AadharDetails (attributes(0).trim.toInt, attributes(1), attributes(2), attributes(3), attributes(4), attributes(5), attributes(6).trim.toInt, attributes(7),attributes(8).trim.toInt, attributes(9).trim.toInt, attributes(10).trim.toInt, attributes(11).trim.toInt, attributes(12).trim.toInt)).toDF()
df: org.apache.spark.sql.DataFrame = [DateType: int, Registrar: string ... 11 more fields]
scala> df.printSchema()
root
|-- DateType: integer (nullable = true)
|-- Registrar: string (nullable = true)
|-- PrivateAgency: string (nullable = true)
|-- State: string (nullable = true)
|-- District: string (nullable = true)
|-- SubDistrict: string (nullable = true)
|-- PinCode: integer (nullable = true)
|-- Gender: string (nullable = true)
|-- Age: integer (nullable = true)
|-- AadharGenerated: integer (nullable = true)
|-- Rejected: integer (nullable = true)
|-- MobileNo: integer (nullable = true)
|-- email_id: integer (nullable = true)
df.createOrReplaceTempView("data")
scala> spark.sql("select distinct DateType from data").show()
**Will throw an error**, please let me know why distinct does not work here..!!
样本数据:20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Delhi,South Delhi,Defence Colony,110025,F,49,1,0,0,1
20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Delhi,South Delhi,Defence Colony,110025,F,65,1,0,0,0
解决方案
这可能是由于 DateTypevalues 列中的数据类型不兼容而发生的。它可能包含不被解释为有效整数表示的空值或字符串:
您也可能会收到数据框显示错误。
scala> df.show()
检查您的源数据以确认数据类型不匹配的问题。
推荐阅读
- python - Python函数重新分配外部列表值
- c# - System.Text.Json 检查数组是否为空
- perl - Perl - 尝试对哈希对象进行排序时出现分段错误
- django - Django Object is Not Serializable CommandError using Dumpdata with Natural Keys
- c++ - 在 GLSL 着色器中采样 GL_UNSIGNED_SHORT 类型的 GL_DEPTH_COMPONENT
- amazon-web-services - 传递 AWS 系统管理器参数存储变量时,Terraform AWS 提供商凭证无效
- java - 为什么spring会抛出一个关于糟糕的sql代码的异常?
- authentication - remote-server authorized_keys 已经有我的 macOS 的 id_rsa.pub,但是没有密码仍然无法登录
- forms - 屏幕锁定时不显示 PowerShell 表单
- javascript - JSON.parse 不转换字符串化 JSON 数组