首页 > 技术文章 > CodeForces 1098F. Ж-function

TinyWong 2019-02-11 20:33 原文

题目简述:给定字符串$s[1 \dots n](n \leq 2 \times 10^5)$,以及$Q \leq 2 \times 10^5$个询问,每个询问有两个参数$1 \leq l \leq r \leq n$,求

$$ \sum_{i=l}^r \operatorname{lcp}(s[l \dots r], s[i \dots r]), $$

其中$\operatorname{lcp}(s, t)$表示字符串$s$和$t$的最长公共前缀(Longest Common Prefix)的长度。

解:

code

 

后缀自动机(Suffix Automaton)与后缀树(Suffix Tree)

后缀自动机学习资料:

0. 论文:E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3): 249-260 (1995)

1. CSDN

2. SAISUMIT

3. CodeForces

后缀树学习资料:

1. cnblogs

 

为了方便起见,我们在这里再次简要申明后缀自动机和后缀树的定义,以防止混淆各种不同定义的细微差别。

后缀自动机

一个(确定)有限状态自动机((Deterministic) Finite-State Automaton)$A = (Q, \Sigma, \delta, q_0, F)$,其中

  ·$Q$是一个有限的状态集合。

  ·$\Sigma$是一个有限的字符集。

  ·$\delta: Q \times \Sigma \to Q$是转移函数。若$\delta(p,a)=q$,则可看作是$p$至$q$有一条权值为$a$的边。

  ·$q_0 \in Q$是初始状态。

  ·$F \subseteq Q$是接受状态的集合。

一个非空字符串$s$的后缀自动机$\text{SA}(s)$是一个有限状态自动机$A = (Q, \Sigma, \delta, q_0, F)$,其中

  ·$Q$和$\delta$构成了一个连通的有向无环图(Directed Acyclic Graph)。

  ·$F = \{ f_1, f_2, \dots, f_{|s|} \}$,其中$f_{k}$对应后缀$s[k\dots |s|]$。若对$q \in Q$,定义$ \text{str}(q) = \{ t \in \Sigma^*: \delta(q_0, t) = q \} $,则$\text{str}(f_k) \supseteq \{ s[k \dots |s|] \}$。

这样,被$\text{SA}(s)$接受的字符串即为$s$的后缀,严格的说,$\mathcal{L}(\text{SA}(s)) = \text{suffix}(s)$,其中$\text{suffix}(s) = \{ s[k \dots |s|]: 1 \leq k \leq |s| \}$。

状态$q$的代表字符串是$\text{str}(q)$中的最长串,为方便起见仍然记作$q$。对任意$q \in Q$,存在$1 \leq k \leq |q|$,使得$\text{str}(q) = \{ q[i\dots |q|]: 1 \leq i \leq k \}$。定义状态$q$的后缀链接(或者称为失败指针)$\text{link}(q)$为$|p|$最长的$p$,使得$p \in \text{suffix}(q)$且$p \notin \text{str}(q)$,即$\text{link}(q) = q[k+1 \dots |q|]$。注意到$|q| > |\text{link}(q)|$,从而$Q$和$\text{link}$构成了一棵有根树,树根为$q_0$,记这棵树为$T(\text{SA}(s))$。

定理:存在$O(n)$算法,对给定字符串$s$,构造其后缀自动机$\text{SA}(s)$。

后缀树

一个非空字符串$s$的后缀树$\text{ST}(s)$是一棵(最简)字典树(Trie),这棵字典树的单词表为$\text{suffix}(s)$,即$s$的所有后缀。$\text{ST}(s)$的每条边上都有$s$的某个子串作为其权值,从而$\text{ST}(s)$每个节点$v$都对应一个$s$的子串(即从根到这个节点路径上所有边权的连接),记作$\text{str}(v)$。每个$s$的后缀$s[k \dots |s|]$都对应$\text{ST}(s)$的一个叶子节点。而每个节点$v$的父节点$w$满足$\text{str}(w)$是$\text{str}(v)$的前缀,且$\text{str}(w)$长度最长。

定理:$T(\text{SA}(s)) = \text{ST}(\text{rev}(s))$,其中$\text{rev}(s)$表示$s$的翻转字符串。

因此,可以通过先构造字符串$\text{rev}(s)$的后缀自动机,来构造$s$的后缀树。(这是因为代码量较短)

 

问题转化

定义$\text{ST}(s)$中,节点$v$的深度为$\text{depth}(v) = |\text{str}(v)|$。对$1 \leq l \leq |s| = n$,我们把$s[l\dots |s|]$对应的后缀树节点记作$l$,即$\text{str}(l) = s[l \dots n]$。令$L = r-l+1$,则

$$
\begin{aligned}
\sum_{i=l}^r \operatorname{lcp}(s[l \dots r], s[i \dots r])
& = \sum_{k=1}^{L} \operatorname{lcp}(s[l \dots r], s[l+k-1 \dots r]) \\
& = \sum_{k=1}^{L} \min\{\text{depth}(\text{LCA}(l, l+k-1)), L-k+1 \},
\end{aligned}
$$

其中$\text{LCA}(u, v)$表示节点$u$和$v$的最近公共祖先(Least/Lowest Common Ancestor)。令$v(k)$表示节点$v$的祖先中,其父节点深度$< k$的最近祖先,即:若$u = v(k)$,则$\text{depth}(u) \geq k$且$\text{depth}(\text{father}(u)) < k$。令$\text{leaf}(v) \subseteq [n]$表示节点$v$后代中的所有叶子节点,其中$[n] = \{1,2,\dots,n\}$。从而,

$$
\begin{aligned}
& = \sum_{k=1}^L \left| \text{leaf}(l(k)) \cap [l, l+L-k] \right| \\
& = \sum_{k=1}^L \left| \text{leaf}(l(k)) \cap [l+L-k] \right| - \sum_{k=1}^L \left| \text{leaf}(l(k)) \cap [l-1] \right| \\
& = S_2(L, l, l+L) - S_1(L, l, l-1)
\end{aligned}
$$

其中$S_1(L, l, M) = \sum_{k=1}^L \left| \text{leaf}(l(k)) \cap [M] \right|$,$S_2(L, l, M) = \sum_{k=1}^L \left| \text{leaf}(l(k)) \cap [M-k] \right|$。

树链剖分

令$\text{son}(v)$表示节点$v$的所有儿子节点,$\text{size}(v)$表示以节点$v$为根的子树的大小。定义$\text{heavy}(v) = \arg \max_{u \in \text{son}(v)} \text{size}(u)$表示节点$v$具有最大子树的儿子,他们之间的边称为重边,而与其余儿子之间的边称为轻边。

设根节点为$r$,对任意节点$v$,节点$v$到根节点$r$的路径$P = \text{path}(v \to r)$可划分成若干段子路径$P_1, P_2, \dots, P_m$,其中$P_i = \text{path}(x_i \to y_i)$,满足$x_1 = v$,$y_m = r$且$\text{father}(y_{i}) = x_{i+1}$,每段子路径$P_i$都仅由重边组成。我们称$P_1, P_2, \dots, P_m$是路径$P$的剖分。我们把$P_i = (\mathit{belong}_i, \mathit{no}_i, \mathit{len}_i)$记作$(y_i, v, \text{depth}(x_i)-\text{depth}(\text{father}(y_i)))$,表示该路径的根为$\mathit{belong}_i$,是路径$\text{path}(\mathit{no}_i, r)$的一段子路径,且这段路径的长度为$\mathit{len}_i$。

定理:子路径段数$m \leq \lfloor \log_2 |V| \rfloor+1$,其中$|V|$表示节点个数。

记$\text{route}(v)$表示路径$\text{path}(v \to r)$的剖分,记$\text{route}_{\leq L}(v)$表示路径$\text{path}(v \to r)$中深度$\leq L$部分的剖分,即

$$ \text{route}_{\leq L}(v) = \{ (\mathit{belong}, \mathit{no}, \min\{\mathit{len},L-\text{depth}(\text{father}(\mathit{belong}))\}): (\mathit{belong}, \mathit{no}, \mathit{len}) \in \text{route}(v) \land \text{depth}(\text{father}(\mathit{belong}))<L \}. $$

特别地,令$\text{route}_{=L}(v) = \text{route}_{\leq L}(v) - \text{route}_{\leq L-1}(v)$。

计算$S_1(L,l,M)$

于是

$$
\begin{aligned}
S_1(L, l, M) 
& = \sum_{k=1}^L \sum_{v = 1}^{M} [v \in \text{leaf}(l(k))] \\
& = \sum_{b} \sum_{k=1}^L \sum_{(b, l, \mathit{len}_q) \in \text{route}_{= k}(l)} \sum_{v=1}^{M} \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [\mathit{len}_q \leq \mathit{len}_p] \\
& = \sum_{(b, l, \mathit{len}_q) \in \text{route}_{\leq L}(l)} \sum_{v=1}^{M} \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} \min\{\mathit{len}_q, \mathit{len}_p\}
\end{aligned}
$$

在固定$b$和$\mathit{len}_q$后,令

$$ a[i] = \sum_{v=1}^M \sum_{(b,v,i)\in\text{route}(v)} 1. $$

$$
\begin{aligned}
\sum_{v=1}^{M} \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} \min\{\mathit{len}_q, \mathit{len}_p\}
& = \sum_{i} a[i] \min\{\mathit{len}_q, i\} \\
& = \mathit{len}_q \sum_{i > \mathit{len}_q} a[i] + \sum_{i=1}^{\mathit{len}_q} ia[i],
\end{aligned}
$$

可通过维护$a[i]$的前缀和以及$ia[i]$的前缀和来计算。

我们可以在枚举$b$之后,从小到大枚举$v$(这就是扫描线),并维护$a[i]$及其相关前缀和,当枚举到$v = M$时,可计算$S_1(L,l,M)$在固定$b$后的贡献值。

计算$S_2(L,l,m)$

$$
\begin{aligned}
S_2(L, l, M) 
& = \sum_{k=1}^L \sum_{v = 1}^{M-k} [v \in \text{leaf}(l(k))] \\
& = \sum_{b} \sum_{k=1}^L \sum_{(b, l, \mathit{len}_q) \in \text{route}_{\leq k}(l)} \sum_{v=1}^{M-k} \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [\mathit{len}_q \leq \mathit{len}_p] \\
& = \sum_{(b, l, \mathit{len}_q) \in \text{route}_{\leq L}(l)} \sum_{k=1}^{\mathit{len}_q} \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [k \leq \mathit{len}_p \land \mathit{pos}_p+k-1 \leq M],
\end{aligned}
$$

其中$\mathit{pos}_p = v+\text{depth}(\text{father}(b))+1$。

固定$b$和$\mathit{len}_q$后,讨论$\mathit{pos}_p+\mathit{len}_p$与$M+1$的大小关系,$S_2(L,l,M)$求和可分为两部分:

$$
\begin{aligned} & \quad \sum_{k=1}^{\mathit{len}_q} \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [k \leq \mathit{len}_p \land \mathit{pos}_p+k-1 \leq M] \\
& = \sum_{k=1}^{\mathit{len}_q} \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [k \leq \mathit{len}_p \land \mathit{pos}_p+k-1 \leq M \land \mathit{pos}_p+\mathit{len}_p \leq M+1] \\ & \quad + \sum_{k=1}^{\mathit{len}_q} \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [k \leq \mathit{len}_p \land \mathit{pos}_p+k-1 \leq M \land \mathit{pos}_p+\mathit{len}_p > M+1] \end{aligned}
$$

Part 1

若$\mathit{pos}_p+\mathit{len}_p \leq M+1$,则条件$\mathit{pos}_p+k-1 \leq M$是冗余的,于是求和可简化为

$$ \begin{aligned} & \quad \sum_{k=1}^{\mathit{len}_q} \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [k \leq \mathit{len}_p \land \mathit{pos}_p+k-1 \leq M \land \mathit{pos}_p+\mathit{len}_p \leq M+1] \\ & = \sum_{k=1}^{\mathit{len}_q} \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [k \leq \mathit{len}_p \land \mathit{pos}_p+\mathit{len}_p \leq M+1] \\ & = \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [\mathit{pos}_p+\mathit{len}_p \leq M+1] \min\{\mathit{len}_q, \mathit{len}_p\} \end{aligned} $$

类似于$S_1$的计算方式,令

$$ a[i] = \sum_{v=1}^n \sum_{(b,v,\mathit{len}_p)\in\text{route}(v)} [\mathit{len}_p+\mathit{pos}_p \leq M+1 \land \mathit{len}_p = i]. $$

$$\begin{aligned}
 \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [\mathit{pos}_p+\mathit{len}_p \leq M+1] \min\{\mathit{len}_q, \mathit{len}_p\} 
& = \sum_{i} a[i] \min\{\mathit{len}_q, i\} \\
& = \mathit{len}_q \sum_{i > \mathit{len}_q} a[i] + \sum_{i=1}^{\mathit{len}_q} ia[i],
\end{aligned}$$

我们以$\mathit{pos}_p+\mathit{len}_p$从小到大做扫描线,维护$a[i]$及其相关前缀和,当枚举到$\mathit{pos}_p+\mathit{len}_p = M+1$时,可计算 Part 1 的贡献值。

Part 2

若$\mathit{pos}_p+\mathit{len}_p > M+1$,则$1+\mathit{pos}_p \leq k+\mathit{pos}_p \leq M+1 < \mathit{len}_p+\mathit{pos}_p$。

$$
\begin{aligned}
& \quad \sum_{k=1}^{\mathit{len}_q} \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [k \leq \mathit{len}_p \land \mathit{pos}_p+k-1 \leq M \land \mathit{pos}_p+\mathit{len}_p > M+1] \\
& = \sum_{k=1}^{\mathit{len}_q} \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [1+\mathit{pos}_p \leq k+\mathit{pos}_p \leq M+1 \land \mathit{pos}_p+\mathit{len}_p > M+1] \\
& = \sum_{k=1}^{\mathit{len}_q} \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [\mathit{pos}_p \leq M-k+1 \land \mathit{pos}_p+\mathit{len}_p > M+1]
\end{aligned}
$$

$$ a[i] = \sum_{v=1}^n \sum_{(b,v,\mathit{len}_p)\in\text{route}(v)} [\mathit{len}_p+\mathit{pos}_p > M+1 \land \mathit{pos}_p = i]. $$

$$
\begin{aligned}
& \quad \sum_{k=1}^{\mathit{len}_q} \sum_{v=1}^n \sum_{(b, v, \mathit{len}_p) \in \text{route}(v)} [\mathit{pos}_p \leq M-k+1 \land \mathit{pos}_p+\mathit{len}_p > M+1] \\
& = \sum_{k=1}^{\mathit{len}_q} s[M-k+1] \\
& = \mathit{len}_q s[M-\mathit{len}_q+1] + (M+1) (s[M]-s[M-\mathit{len}_q+1]) - (t[M]-t[M-\mathit{len}_q+1]),
\end{aligned}
$$

其中$s[i] = a[1]+\dots+a[i], t[i] = a[1]+2a[2]+\dots+ia[i]$。

我们以$\mathit{pos}_p+\mathit{len}_p$从大到小做扫描线,维护$a[i]$及其相关前缀和,当枚举到$\mathit{pos}_p+\mathit{len}_p = M+2$时,可计算 Part 2 的贡献值。

 

时间复杂度

设字符串$s$的长度为$|s| = n$,构建后缀树的时间为$O(n)$。根据树链剖分,子问题一共有$O((n+q) \log n)$个,每个子问题需要离线做扫描线利用树状数组求解,故总的时间复杂度为$O((n+q)\log^2 n)$。

推荐阅读