sql - SQL 将列配对一次
问题描述
给定一个包含 X 和 Y 列的表 T,我想获得 X 和 Y 列的对,但是一旦在对中使用了一个值,我不希望它再次出现。例子:
X Y
-----
A 1
A 2
A 3
B 1
B 2
B 3
C 4
D 5
D 6
E 5
E 6
我希望 SQL 输出:
X Y
-----
A 1
B 2
C 4
D 5
E 6
请注意,需要忽略 Y 中的 3 的值,因为在 X 列上没有其他匹配值可供它使用。需要保留顺序并且第一个匹配对是有效的。
解决方案
这只能在纯 SQL 中通过特定数据库(如 Oracle)的高级特性来解决。此任务也是一个很好的示例,可以演示特定 SQL 功能何时真正重要。
下面我将演示使用 Oracle 和 Spark 的各种方法。
create table t(x, y) as
(select 'A', 1 from dual
union all select 'A', 2 from dual
union all select 'A', 3 from dual
union all select 'B', 1 from dual
union all select 'B', 2 from dual
union all select 'B', 3 from dual
union all select 'C', 4 from dual
union all select 'D', 5 from dual
union all select 'D', 6 from dual
union all select 'E', 5 from dual
union all select 'E', 6 from dual);
1. Oracle model clause
(版本 10gR1 及更高版本)
SQL> select x, y from
2 (
3 select *
4 from t
5 model
6 dimension by (x, y)
7 measures (0 z)
8 (z[any, any] order by x, y =
9 decode(max(z)[x<=cv(x),cv(y)]+max(z)[cv(x),y<=cv(y)],0,1,0))
10 )
11 where z = 1;
X Y
- ----------
A 1
B 2
C 4
D 5
E 6
2. Oracle recursive with
(版本 11gR2 及更高版本)
with t0 as(select t.*, row_number() over (order by x, y) rn from t)
, r(rn, x, y, z, c, flag) as
(
select t0.rn, t0.x, t0.y, 1, ku$_objnumset(t0.y), 1
from t0
where rn = 1
union all
select t0.rn, t0.x, t0.y
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c then 1 else 0 end
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c then c multiset union ku$_objnumset(t0.y) else c end
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c or (flag = 1 and r.x = t0.x) then 1 else 0 end
from t0
join r on r.rn + 1 = t0.rn
)
select x, y
from r
where z = 1
order by 1;
3. Oracle recursive with + cross apply
(版本 12cR1 及更高版本)
with
t0 as
(select t.*, dense_rank() over(order by x) rnk from t),
r(rnk, x, y, c) as
(select 1, t0.x, min(t0.y), numbers(min(t0.y))
from t0
where rnk = 1
group by t0.rnk, t0.x
union all
select r.rnk + 1, t0.x, t0.y, decode(t0.y, null, r.c, r.c multiset union numbers(t0.y))
from r
cross apply (select min(t0.x) x, min(t0.y) y
from t0
where r.rnk + 1 = t0.rnk
and t0.y not member of r.c) t0
where rnk < (select max(rnk) from t0)
) cycle x set dummy to 1 default 0
select x, y from r where y is not null order by 1;
当在递归成员中使用交叉应用时,Oracle 在检测循环方面存在一些问题,因此添加了虚拟列以避免异常(即使没有实际循环)。
在所有上述方法中,只有在之前的 sy
中没有选择它时才能选择它。X
但是,如果目标是选择y
大于 current的 next min y
,那么下面也将起作用。
4. Oracle 递归有和没有集合来跟踪分配Y
的 s
with
t0 as
(select t.*, dense_rank() over(order by x) rnk from t),
r(rnk,y) as
(
select 1, min(t0.y)
from t0
where rnk = 1
union all
select r.rnk + 1,
nvl((select min(t0.y) y
from t0
where r.rnk + 1 = t0.rnk
and t0.y > r.y), r.y)
from r
where rnk < (select max(rnk) from t0)
)
select x, y
from t0
join (select y, min(rnk) rnk from r group by y) r using (rnk, y)
order by 1;
5. Oracleconnect by
子句(连接字符串的限制为 4000 个字符)
select regexp_substr(m, '~([^#]+)#\d+', 1, level, null, 1) x,
regexp_substr(m, '~[^#]+#(\d+)', 1, level, null, 1) y
from (select min(p) keep(dense_rank first order by l desc, rn) m
from (select rownum rn,
level l,
sys_connect_by_path(x || '#' || y, '~') p
from t
start with (x, y) in (select min(x), min(y) from t)
connect by prior x < x
and prior y < y
and rownum <= (select count(distinct x) from t)))
connect by regexp_substr(m, '~[^#]+#\d', 1, level) is not null;
最后是 Spark + 纯函数式 Scala!(虽然不是 Spark SQL)
val q = """
|select "A" x, 1 y
|union all select "A", 2
|union all select "A", 3
|union all select "B", 1
|union all select "B", 2
|union all select "B", 3
|union all select "C", 4
|union all select "D", 5
|union all select "D", 6
|union all select "E", 5
|union all select "E", 6
"""
val df = spark.sql(q)
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.DataFrame
import scala.annotation.tailrec
数据框的转换。
def filterRows(df: DataFrame) = {
val schema = StructType(df.schema.fields :+ StructField("flag", IntegerType))
implicit val encoder = RowEncoder(schema)
def getFlag(chk: Boolean, y: Int, a: List[Int]): Integer =
if (chk && a.forall(_ != y)) 1 else null
def updAllocated(f: Integer, y: Int, l: List[Int]) = if (f == 1) y :: l else l
@tailrec def doIt_(iter: Iterator[Row], prevX: String, doCheck: Boolean,
allocated: List[Int], res: Iterator[Row]): Iterator[Row] = {
if (iter.hasNext) {
val r = iter.next
val curX: String = r.getAs("x")
val curY: Int = r.getAs("y")
val doCheck_ = doCheck || curX != prevX
val flag = getFlag(doCheck_, curY, allocated)
doIt_(iter, curX, doCheck_ && flag != 1, updAllocated(flag, curY, allocated),
res ++ Iterator(Row.fromSeq(r.toSeq :+ flag)))
} else res
}
def doIt(iter: Iterator[Row]): Iterator[Row] =
doIt_(iter, "", true, List[Int](), Iterator[Row]())
df.repartition(1).sortWithinPartitions($"x", $"y").mapPartitions(doIt).
filter("flag is not null").select($"x", $"y")
}
在 Spark shell 中测试
scala> filterRows(df).show
+---+---+
| x| y|
+---+---+
| A| 1|
| B| 2|
| C| 4|
| D| 5|
| E| 6|
+---+---+
PS。即使是最直接的 CTE 修改(适用于 MSSQL 方言的选项 4)在 MSSQL 2017 中也不起作用。因为递归成员的标量子查询中不允许使用组函数。
GROUP BY, HAVING, or aggregate functions are not allowed in the recursive part of a recursive common table expression 'r'.
min(t0.y)
用它替换后top 1 (t0.y)
抛出异常
The TOP or OFFSET operator is not allowed in the recursive part of a recursive common table expression 'r'.
Y
但是,如果我们在串联字符串而不是集合中跟踪分配的 s,则选项 2 可以适用于 MSSQL 。