A few weeks ago I read this article about writing a simple regular expression parser. That article does a really good job of explaining the theory behind regular expression. It then goes step by step into how to write a program (he uses C++) to parse a regular expression, convert it into a NFA, convert that into a DFA and then use that DFA to match strings.
After reading that I decided to write my own simple regular expression parser using Haskell. I saw it as a challenge to try to see how you deal with a more complex program in a pure functional language. After a couple weeks ( Grand Theft Auto 4 kind of ruined my progress for a while ) I have some results.
I split the project into 3 modules.
- RegexToNFA – Provides functionality to parse a simple regular expression and return a NFA.
- This modules also define the FiniteMachine type which is a general structure for finite state automata.
- NFAtoDFA – Providers functionality to convert a NFA into a DFA.
- This module uses the same FiniteMachine type from RegexToNFA
- SimpleRegex – Provides the functionality to give take a regular expression and a string and return what it matches (if it matches anything).
- This modules uses RegexToNFA and sends its results to NFAtoDFA and then uses the resulting DFA to match against a string.
This is a very simple and limited regular expression parser. It supports only union(|), concatenation, closure(*) and parenthesis. In addition, I don’t preserve information after the NFA is created about the location of the parenthesis. This means you can’t pull out sub-matches when a entire expression matches.
In my next three I will talk about each module and point out interesting parts of them. There is nothing too complex but shows how to approach it in Haskell (making heavy use of the State monad).
If you want to see a much more complex and full featured regular expression parser written in Haskell take a look at this.
Click here to continue to Part 2.