regex - 如何在spark sql中提取where子句作为数组?
问题描述
我正在尝试从 SQL 查询中提取 where 子句。where 子句中的多个条件应在表单数组中。请帮我。
示例输入字符串:
select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null
预期输出:
Array("col1=1", "(col2 between 1 and 10 or col2 between 190 and 200)", "col2 is not null")
提前致谢。
编辑:
我的问题是......我想将所有条件拆分为单独的项目......假设我的查询就像
select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null
我期待的输出就像
List("col1=1", "col2 between 1 and 10", "col2 between 190 and 200", "col2 is not null")
问题是查询可能有多个级别的条件,例如
select * from table where col1=1 and (col2 =2 or(col3 between 1 and 10 or col3 is between 190 and 200)) and col4='xyz'
在输出中,每个条件都应该是一个单独的项目
List("col1=1","col2=2", "col3 between 1 and 10", "col3 between 190 and 200", "col4='xyz'")
解决方案
我不会为此使用正则表达式。这是根据 Catalyst 的逻辑计划提取条件的另一种方法:
val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
f.condition.productIterator.flatMap{
case And(l,r) => Seq(l,r)
case o:Predicate => Seq(o)
}
}.toList.flatten
println(predicates)
输出 :
List(('col1 = 1), ((('col2 >= 1) && ('col2 <= 10)) || (('col2 >= 190) && ('col2 <= 200))), isnotnull('col2))
这里的谓词仍然是Expressions
并保存信息(树表示)。
编辑:正如评论中所问,这里是谓词的字符串(我希望用户友好)表示:)
val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
f.condition.productIterator.flatMap{
case o:Predicate => Seq(o)
}
}.toList.flatten
def stringifyExpressions(expression: Expression): Seq[String] = {
expression match{
case And(l,r) => (l,r) match {
case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
}
case Or(l,r) => Seq(Seq(l,r).flatMap(stringifyExpressions).mkString("(",") OR (", ")"))
case eq: EqualTo => Seq(s"${eq.left.toString} = ${eq.right.toString}")
case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
case p: Predicate => Seq(p.toString)
}
}
val stringRepresentation = predicates.flatMap{stringifyExpressions}
println(stringRepresentation)
新输出:
List('col1 = 1, ('col2 between 1 and 10) OR ('col2 between 190 and 200), 'col2 is not null)
stringifyExpressions
如果要自定义输出,可以继续使用递归方法。
编辑2:响应您自己的编辑:
您可以将Or
/EqualTo
情况更改为以下
def stringifyExpressions(expression: Expression): Seq[String] = {
expression match{
case And(l,r) => (l,r) match {
case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
}
case Or(l,r) => Seq(l,r).flatMap(stringifyExpressions)
case EqualTo(l,r) =>
val prettyLeft = if(l.resolved && l.dataType == StringType) s"'${l.toString}'" else l.toString
val prettyRight = if(r.resolved && r.dataType == StringType) s"'${r.toString}'" else r.toString
Seq(s"$prettyLeft=$prettyRight")
case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
case p: Predicate => Seq(p.toString)
}
}
这给出了 4 个元素 List :
List('col1=1, 'col2 between 1 and 10, 'col2 between 190 and 200, 'col2 is not null)
对于第二个例子:
select * from table where col1=1 and (col2 =2 or (col3 between 1 and 10 or col3 between 190 and 200)) and col4='xyz'
你会得到这个输出(List[String]
有 5 个元素):
List('col1=1, 'col2=2, 'col3 between 1 and 10, 'col3 between 190 and 200, 'col4='xyz')
附加说明:如果要打印不带起始引号的属性名称,可以通过打印 this 而不是 toString 来处理它:
node.asInstanceOf[UnresolvedAttribute].name
推荐阅读
- javascript - 加入和减少对象数组
- javascript - 机器人不会加入语音频道
- sql - 在单个 SQL 查询中添加多个 IN 条件和汇总时间段
- python - Tkinter 按钮禁用背景
- c - 如何解决 10.4 MISRA 2012 规则?
- android - 当用户单击 mMap.setOnMarkerClickListener 时如何发送值?
- c# - 一对多关系无法获取与会者列表
- curl - Imgur API 破坏包含表情符号的有效负载
- android-databinding - 在 ViewPager2 之前附加的 TabLayoutMediator 有一个适配器
- python - 将 Docker 目录挂载到主机目录不起作用