首页 > 技术文章 > beam search

curtisxiao 2019-05-07 21:20 原文

Beam Search

generate (or “decode”) the target sentence by taking argmax on each step of the decoder

problem with greedy search :

  • Greedy decoding has no way to undo decisions!
    • Input: il a m’entarté (he hit me with a pie)
    • → he ____
    • → he hit ____
    • → he hit a ____ (whoops! no going back now…)

Exhaustive search decoding

Ideally we want to find a (length T) translation y that maximizes :

\[\begin{aligned} P(y | x) &=P\left(y_{1} | x\right) P\left(y_{2} | y_{1}, x\right) P\left(y_{3} | y_{1}, y_{2}, x\right) \ldots, P\left(y_{T} | y_{1}, \ldots, y_{T-1}, x\right) \\ &=\prod_{t=1}^{T} P\left(y_{t} | y_{1}, \ldots, y_{t-1}, x\right) \end{aligned} \]

We could try computing all possible sequences y:

  • This means that on each step t of the decoder, we’re tracking \(V^t\) possible partial translations, where \(V\) is vocab size
  • This \(O(V^t)\) complexity is far too expensive!

beam search

  • Core idea : On each step of decoder, keep track of the k most probable partial translations (which we call hypotheses), where \(k\) is the beam size (in practice around 5 to 10)

  • A hypothesis \(y_1,\cdots,y_t\) has a score which is its log probability:

    \[\operatorname{score}\left(y_{1}, \ldots, y_{t}\right)=\log P_{\mathrm{LM}}\left(y_{1}, \ldots, y_{t} | x\right)=\sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \]

    • Scores are all negative, and higher score is better
    • We search for high-scoring hypotheses, tracking top \(k\) on each step
  • Beam search is not guaranteed to find optimal solution

  • But much more efficient than exhaustive search!

Beam search decoding: stopping criterion

  • In greedy decoding, usually we decode until the model produces a token
  • In beam search decoding, different hypotheses may produce tokens on different timesteps
    • When a hypothesis produces , that hypothesis is complete.
    • Place it aside and continue exploring other hypotheses via beam search.
  • Usually we continue beam search until:
    • We reach timestep T (where T is some pre-defined cutoff), or
    • We have at least n completed hypotheses (where n is pre-defined cutoff)

Beam search decoding: finishing up

  • We have our list of completed hypotheses.
  • How to select top one with highest score?
  • Each hypothesis \(y_1,\cdots,y_t\) on our list has a score

    \[\operatorname{score}\left(y_{1}, \ldots, y_{t}\right)=\log P_{\mathrm{LM}}\left(y_{1}, \ldots, y_{t} | x\right)=\sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \]

Problem with this: longer hypotheses have lower scores

Fix : Normalize by length. Use this to select top one instead:

\[\frac{1}{t} \sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \]

推荐阅读