首页 > 解决方案 > Convert regex pattern to LL1 parser

问题描述

Background: I'm trying to solve this leetcode problem: regular-expression-matching. My approach is to implement a LL(1) parser generator, might be overkilling for this problem but it's just a brain exercise.

I only have entry level knowledge of parser theory, so excuse me if this is a silly question that I ask.

So, there's this testcase that I fail. Requirement is regex pattern should match the whole string

Regex: a*a
Input: aaa

I cannot wrap my mind about how to convert this pattern into a LL(1) parser.

The way I see it, a*a pattern can be converted into production rules:

S -> Aa      # a*a
A -> aA | ε  # a*

Parse table:

a $
S S -> Aa
A A -> aA A -> ε

Here's the steps to parse:

0: S$    aaa$  # use [S,a]
1: Aa$   aaa$  # use [A,a]
2: aAa$  aaa$  # eat 'a'
3: Aa$   aa$   # use [A,a]
4: aAa$  aa$   # eat 'a'
5: Aa$   a$    # use [A,a]
6: aAa$  a$    # eat 'a'
7: Aa$   $     # use [A,$]
8: a$    $     # Error!

The correct matching should be:

a* -> aa
a -> a

But what I get instead is:

a* -> aaa
a -> Error!

I don't know which part I'm missing .

标签: regexparsing

解决方案


At parsing step 5, the parser should have used [A,$] instead of [A,a]. But it could only make this decision by looking ahead, or with backtracking. Both of which are beyond LL(1).

The cause of the problem is in your production rules. There is a FIRST/FOLLOW conflict: FIRST(A) contains ε, and FIRST(A) and FOLLOW(A) both contain a. This is explained with an example here:

https://inst.eecs.berkeley.edu/~cs164/sp18/discussion/04/04-solutions.pdf

In more intuitive terms, you are demanding the parser to look ahead, beyond the a in S -> Aa, to be able to decide how many terminals a should be consumed by A -> aA | ε.

Basically, an LL(1) parser cannot parse a*a. It would have to be rewritten to aa* first. There may be algorithms that are successful at rewriting something as simple as a*a to aa*, but it is impossible to rewrite every possible regex to an LL(1) grammar.


推荐阅读