Beam Search

greedy search

generate (or “decode”) the target sentence by taking argmax on each step of the decoder

problem with greedy search :

Greedy decoding has no way to undo decisions!
- Input: il a m’entarté (he hit me with a pie)
- → he ____
- → he hit ____
- → he hit a ____ (whoops! no going back now…)

Exhaustive search decoding

Ideally we want to find a (length T) translation y that maximizes :

\[\begin{aligned} P(y | x) &=P\left(y_{1} | x\right) P\left(y_{2} | y_{1}, x\right) P\left(y_{3} | y_{1}, y_{2}, x\right) \ldots, P\left(y_{T} | y_{1}, \ldots, y_{T-1}, x\right) \\ &=\prod_{t=1}^{T} P\left(y_{t} | y_{1}, \ldots, y_{t-1}, x\right) \end{aligned} \]

We could try computing all possible sequences y:

This means that on each step t of the decoder, we’re tracking \(V^t\) possible partial translations, where \(V\) is vocab size
This \(O(V^t)\) complexity is far too expensive!

beam search

Core idea : On each step of decoder, keep track of the k most probable partial translations (which we call hypotheses), where \(k\) is the beam size (in practice around 5 to 10)
A hypothesis \(y_1,\cdots,y_t\) has a score which is its log probability:

\[\operatorname{score}\left(y_{1}, \ldots, y_{t}\right)=\log P_{\mathrm{LM}}\left(y_{1}, \ldots, y_{t} | x\right)=\sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \]
- Scores are all negative, and higher score is better
- We search for high-scoring hypotheses, tracking top \(k\) on each step
Beam search is not guaranteed to find optimal solution
But much more efficient than exhaustive search!

Beam search decoding: stopping criterion

In greedy decoding, usually we decode until the model produces a token
In beam search decoding, different hypotheses may produce
tokens on different timesteps
- When a hypothesis produces , that hypothesis is complete.
- Place it aside and continue exploring other hypotheses via beam search.
Usually we continue beam search until:
- We reach timestep T (where T is some pre-defined cutoff), or
- We have at least n completed hypotheses (where n is pre-defined cutoff)

Beam search decoding: finishing up

We have our list of completed hypotheses.
How to select top one with highest score?
Each hypothesis \(y_1,\cdots,y_t\) on our list has a score
\[\operatorname{score}\left(y_{1}, \ldots, y_{t}\right)=\log P_{\mathrm{LM}}\left(y_{1}, \ldots, y_{t} | x\right)=\sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \]

Problem with this: longer hypotheses have lower scores

Fix : Normalize by length. Use this to select top one instead:

\[\frac{1}{t} \sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \]

beam search

Beam Search

greedy search

Exhaustive search decoding

beam search

Beam search decoding: stopping criterion

Beam search decoding: finishing up

推荐阅读