scala - 过滤计数火花数据帧
问题描述
我有两个如下所示的数据框,我从 MySQL 表中读取逻辑 DF
逻辑 DF:
slNo | filterCondtion |
-----------------------
1 | age > 100 |
2 | age > 50 |
3 | age > 10 |
4 | age > 20 |
InputDF - 从文件中读取:
age | name |
------------------------
11 | suraj |
22 | surjeth |
33 | sam |
43 | ram |
我想从逻辑数据框中应用过滤器语句并添加这些过滤器的计数
结果输出:
slNo | filterCondtion | count |
------------------------------
1 | age > 100 | 10 |
2 | age > 50 | 2 |
3 | age > 10 | 5 |
4 | age > 20 | 6 |
-------------------------------
我尝试过的代码:
val LogicDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/testDB").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "logic_table").option("user", "root").option("password", "password").load()
def filterCount(str: String): Long ={
val counte = inputDF.where(str).count()
counte
}
val filterCountUDF = udf[Long, String](filterCount)
LogicDF.withColumn("count",filterCountUDF(col("filterCondtion")))
错误跟踪:
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string) => bigint)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.Dataset.where(Dataset.scala:1525)
at filterCount(<console>:28)
at $anonfun$1.apply(<console>:25)
at $anonfun$1.apply(<console>:25)
... 21 more
任何替代方案也可以..!提前致谢。
解决方案
没有 UDF 的解决方案
只要您的 logicDF 小到可以收集到驱动程序中,这将起作用。
步骤1
将您的逻辑收集到Array[(Int, String)]
, 为:
val rules = logicDF.collect().map{ r: Row =>
val slNo = r.getAs[Int](0)
val condition = r.getAs[String](1)
(slNo, condition)
}
第2步
使用条件值构建一个新列,将这些规则链接到 whenColumn
中。为此,请使用一些 scala 循环,例如:
val unused = when(lit(false), lit(false))
val filters: Column = rules.foldLeft(unused){
case (acc: Column, (slNo: Int, cond: String)) =>
acc.when(col("slNo") === slNo, expr(cond))
}
//You will get something like:
//when(col("slNo") === 1, expr("age > 10"))
//.when(col("slNo") === 2, expr("age > 20"))
//...
第 3 步
通过连接获取两个 DataFrame 的笛卡尔积,因此您可以将每个规则应用于数据中的每一行:
val joinDF = logicDF.join(inputDF, lit(true), "inner") //inner or whatever
第4步
Column
使用带有条件过滤器的前一个过滤器。
val withRulesDF = joinDF.filter(filters)
第 5 步
分组和计数:
val resultDF = withRulesDF
.groupBy("slNo", "filterCondtion")
.agg(count("*") as "count")
推荐阅读
- go - 删除转换更改语义
- sql - 执行内部联接时 sql 中的预期行数
- oracle - 从 Oracle 到 Spark sql 的查询转换
- flutter - 如何在另一个具有大量参数的 StatefulWidget 类中使用自定义文本字段?
- snowflake-cloud-data-platform - 用于创建表的雪花赠款
- c++ - 当您首先绘制具有相似纹理的精灵时,sf::RenderWindow 绘制速度是否更快
- c++ - 编码接受来自未指定来源的随机位的加密算法
- amazon-web-services - AWS 应用程序负载均衡器抛出 net::ERR_CERT_COMMON_NAME_INVALID
- php - 如何在 laravel 模型中实现字符串处理程序
- python - 如何避免新行作为熊猫数据框中的分隔符