首页 > 解决方案 > 如何在spark sql中提取where子句作为数组?

问题描述

我正在尝试从 SQL 查询中提取 where 子句。where 子句中的多个条件应在表单数组中。请帮我。

示例输入字符串:

select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null 

预期输出:

Array("col1=1", "(col2 between 1 and 10 or col2 between 190 and 200)", "col2 is not null")

提前致谢。

编辑:

我的问题是......我想将所有条件拆分为单独的项目......假设我的查询就像

select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null

我期待的输出就像

List("col1=1", "col2 between 1 and 10", "col2 between 190 and 200", "col2 is not null")

问题是查询可能有多个级别的条件,例如

select * from table where col1=1 and (col2 =2 or(col3 between 1 and 10 or col3 is between 190 and 200)) and col4='xyz'

在输出中,每个条件都应该是一个单独的项目

List("col1=1","col2=2", "col3 between 1 and 10", "col3 between 190 and 200", "col4='xyz'")

标签: regexscalaapache-sparkapache-spark-sql

解决方案


我不会为此使用正则表达式。这是根据 Catalyst 的逻辑计划提取条件的另一种方法:

val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
    f.condition.productIterator.flatMap{
      case And(l,r) => Seq(l,r)
      case o:Predicate => Seq(o)
    }
}.toList.flatten

println(predicates)

输出 :

List(('col1 = 1), ((('col2 >= 1) && ('col2 <= 10)) || (('col2 >= 190) && ('col2 <= 200))), isnotnull('col2))

这里的谓词仍然是Expressions并保存信息(树表示)。

编辑:正如评论中所问,这里是谓词的字符串(我希望用户友好)表示:)

val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
    f.condition.productIterator.flatMap{
      case o:Predicate => Seq(o)
    }
}.toList.flatten

def stringifyExpressions(expression: Expression): Seq[String] = {
  expression match{
    case And(l,r) => (l,r) match {
      case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
      case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
    }
    case Or(l,r) => Seq(Seq(l,r).flatMap(stringifyExpressions).mkString("(",") OR (", ")"))
    case eq: EqualTo => Seq(s"${eq.left.toString} = ${eq.right.toString}")
    case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
    case p: Predicate => Seq(p.toString)
  }
}

val stringRepresentation = predicates.flatMap{stringifyExpressions}

println(stringRepresentation)

新输出:

List('col1 = 1, ('col2 between 1 and 10) OR ('col2 between 190 and 200), 'col2 is not null)

stringifyExpressions如果要自定义输出,可以继续使用递归方法。

编辑2:响应您自己的编辑:

您可以将Or/EqualTo情况更改为以下

def stringifyExpressions(expression: Expression): Seq[String] = {
  expression match{
    case And(l,r) => (l,r) match {
      case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
      case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
    }
    case Or(l,r) => Seq(l,r).flatMap(stringifyExpressions)
    case EqualTo(l,r) =>
      val prettyLeft = if(l.resolved && l.dataType == StringType) s"'${l.toString}'" else l.toString
      val prettyRight = if(r.resolved && r.dataType == StringType) s"'${r.toString}'" else r.toString
      Seq(s"$prettyLeft=$prettyRight")
    case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
    case p: Predicate => Seq(p.toString)
  }
}

这给出了 4 个元素 List :

List('col1=1, 'col2 between 1 and 10, 'col2 between 190 and 200, 'col2 is not null)

对于第二个例子:

select * from table where col1=1 and (col2 =2 or (col3 between 1 and 10 or col3 between 190 and 200)) and col4='xyz'

你会得到这个输出(List[String]有 5 个元素):

List('col1=1, 'col2=2, 'col3 between 1 and 10, 'col3 between 190 and 200, 'col4='xyz')

附加说明:如果要打印不带起始引号的属性名称,可以通过打印 this 而不是 toString 来处理它:

node.asInstanceOf[UnresolvedAttribute].name

推荐阅读