首页 > 解决方案 > SQL 将列配对一次

问题描述

给定一个包含 X 和 Y 列的表 T,我想获得 X 和 Y 列的对,但是一旦在对中使用了一个值,我不希望它再次出现。例子:

X   Y
-----
A   1
A   2
A   3
B   1
B   2
B   3
C   4
D   5
D   6
E   5
E   6

我希望 SQL 输出:

X   Y
-----
A   1
B   2
C   4
D   5
E   6

请注意,需要忽略 Y 中的 3 的值,因为在 X 列上没有其他匹配值可供它使用。需要保留顺序并且第一个匹配对是有效的。

标签: sqlapache-spark-sql

解决方案


这只能在纯 SQL 中通过特定数据库(如 Oracle)的高级特性来解决。此任务也是一个很好的示例,可以演示特定 SQL 功能何时真正重要。

下面我将演示使用 Oracle 和 Spark 的各种方法。

create table t(x, y) as
(select 'A',   1 from dual
union all select 'A',   2 from dual
union all select 'A',   3 from dual
union all select 'B',   1 from dual
union all select 'B',   2 from dual
union all select 'B',   3 from dual
union all select 'C',   4 from dual
union all select 'D',   5 from dual
union all select 'D',   6 from dual
union all select 'E',   5 from dual
union all select 'E',   6 from dual);

1. Oracle model clause(版本 10gR1 及更高版本)

SQL> select x, y from
  2  (
  3    select *
  4    from t
  5    model
  6    dimension by (x, y)
  7    measures (0 z)
  8    (z[any, any] order by x, y =
  9     decode(max(z)[x<=cv(x),cv(y)]+max(z)[cv(x),y<=cv(y)],0,1,0))
 10  )
 11  where z = 1;

X          Y
- ----------
A          1
B          2
C          4
D          5
E          6

2. Oracle recursive with(版本 11gR2 及更高版本)

with t0 as(select t.*, row_number() over (order by x, y) rn from t)
, r(rn, x, y, z, c, flag) as
(
select t0.rn, t0.x, t0.y, 1, ku$_objnumset(t0.y), 1
from t0
where rn = 1
union all
select t0.rn, t0.x, t0.y
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c then 1 else 0 end
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c then c multiset union ku$_objnumset(t0.y) else c end
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c or (flag = 1 and r.x = t0.x) then 1 else 0 end
from t0
join r on r.rn + 1 = t0.rn
)
select x, y
from r
where z = 1
order by 1;

3. Oracle recursive with + cross apply(版本 12cR1 及更高版本)

with
t0 as
 (select t.*, dense_rank() over(order by x) rnk from t),
r(rnk, x, y, c) as
 (select 1, t0.x, min(t0.y), numbers(min(t0.y))
    from t0
   where rnk = 1
   group by t0.rnk, t0.x
  union all
  select r.rnk + 1, t0.x, t0.y, decode(t0.y, null, r.c, r.c multiset union numbers(t0.y))
    from r
   cross apply (select min(t0.x) x, min(t0.y) y
                  from t0
                 where r.rnk + 1 = t0.rnk
                   and t0.y not member of r.c) t0
   where rnk < (select max(rnk) from t0)
 ) cycle x set dummy to 1 default 0
select x, y from r where y is not null order by 1;

当在递归成员中使用交叉应用时,Oracle 在检测循环方面存在一些问题,因此添加了虚拟列以避免异常(即使没有实际循环)。

在所有上述方法中,只有在之前的 sy中没有选择它时才能选择它。X但是,如果目标是选择y大于 current的 next min y,那么下面也将起作用。

4. Oracle 递归有和没有集合来跟踪分配Y的 s

with
t0 as
 (select t.*, dense_rank() over(order by x) rnk from t),
r(rnk,y) as
 (
 select 1, min(t0.y)
   from t0
  where rnk = 1
 union all
 select r.rnk + 1,
        nvl((select min(t0.y) y
              from t0
             where r.rnk + 1 = t0.rnk
               and t0.y > r.y), r.y)
   from r
  where rnk < (select max(rnk) from t0)
 )
select x, y
from t0
join (select y, min(rnk) rnk from r group by y) r using (rnk, y) 
order by 1;

5. Oracleconnect by子句(连接字符串的限制为 4000 个字符)

select regexp_substr(m, '~([^#]+)#\d+', 1, level, null, 1) x,
       regexp_substr(m, '~[^#]+#(\d+)', 1, level, null, 1) y
  from (select min(p) keep(dense_rank first order by l desc, rn) m
          from (select rownum rn,
                       level l,
                       sys_connect_by_path(x || '#' || y, '~') p
                  from t
                 start with (x, y) in (select min(x), min(y) from t)
                connect by prior x < x
                       and prior y < y
                       and rownum <= (select count(distinct x) from t)))
connect by regexp_substr(m, '~[^#]+#\d', 1, level) is not null;

最后是 Spark + 纯函数式 Scala!(虽然不是 Spark SQL)

val q = """
          |select "A" x, 1 y
          |union all select "A", 2
          |union all select "A", 3
          |union all select "B", 1
          |union all select "B", 2
          |union all select "B", 3
          |union all select "C", 4
          |union all select "D", 5
          |union all select "D", 6
          |union all select "E", 5
          |union all select "E", 6
        """
val df = spark.sql(q)

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.DataFrame
import scala.annotation.tailrec

数据框的转换。

def filterRows(df: DataFrame) = {
  val schema = StructType(df.schema.fields :+ StructField("flag", IntegerType))
  implicit val encoder = RowEncoder(schema)

  def getFlag(chk: Boolean, y: Int, a: List[Int]): Integer =
    if (chk && a.forall(_ != y)) 1 else null
  def updAllocated(f: Integer, y: Int, l: List[Int]) = if (f == 1) y :: l else l

  @tailrec def doIt_(iter: Iterator[Row], prevX: String, doCheck: Boolean,
                     allocated: List[Int], res: Iterator[Row]): Iterator[Row] = {
    if (iter.hasNext) {
      val r = iter.next
      val curX: String = r.getAs("x")
      val curY: Int = r.getAs("y")
      val doCheck_ = doCheck || curX != prevX
      val flag = getFlag(doCheck_, curY, allocated)
      doIt_(iter, curX, doCheck_ && flag != 1, updAllocated(flag, curY, allocated),
        res ++ Iterator(Row.fromSeq(r.toSeq :+ flag)))
    } else res
  }

  def doIt(iter: Iterator[Row]): Iterator[Row] =
    doIt_(iter, "", true, List[Int](), Iterator[Row]())

  df.repartition(1).sortWithinPartitions($"x", $"y").mapPartitions(doIt).
    filter("flag is not null").select($"x", $"y")
}

在 Spark shell 中测试

scala> filterRows(df).show
+---+---+
|  x|  y|
+---+---+
|  A|  1|
|  B|  2|
|  C|  4|
|  D|  5|
|  E|  6|
+---+---+

PS。即使是最直接的 CTE 修改(适用于 MSSQL 方言的选项 4)在 MSSQL 2017 中也不起作用。因为递归成员的标量子查询中不允许使用组函数。

GROUP BY, HAVING, or aggregate functions are not allowed in the recursive part of a recursive common table expression 'r'.

min(t0.y)用它替换后top 1 (t0.y)抛出异常

The TOP or OFFSET operator is not allowed in the recursive part of a recursive common table expression 'r'.

Y但是,如果我们在串联字符串而不是集合中跟踪分配的 s,则选项 2 可以适用于 MSSQL 。


推荐阅读