Writing a Regular Expression parser in Haskell: Part 2

The first module in my simple regular expression parse is called RegexToNFA. This module exposes the types that make up a finite state machine and also the functions to convert a regular expression string into a finite state machine.

My structure for a FSM follows closely from the mathematical definition:

data FiniteMachine = FiniteMachine{ table :: [Transition],
                                    alphabet :: Set Char,
                                    start :: Node,
                                    final :: Set Node
                                  } deriving (Show, Eq)

-- NFA node
type Node = Integer

-- The value for an edge in a NFA
type TransitionValue = Maybe Char

-- A transition in a NFA is a tuple of
-- StartNode , DestinationNode, Value to transition on
type Transition = (Node,Node,TransitionValue)

-- The value of the edge in the NFA is a Maybe Char
-- Where Nothing is the epsilon transition
-- therefore lets just rename Nothing to epsilon
epsilon = Nothing

I have the value which you transition on as a Maybe Char (which I alias as TransitionValue). This allowed me to define epsilon as Nothing data constructor.

With this structure defined my goal now is to convert a regular expression pattern such as: (a|b)* into a FiniteMachine. In order to do this there is a lot of state that I need to keep track of which naturally leads to the use of the State monad. To do this I set up a structure for what data I want to be kept track of and then create a state monad using that structure:

-- The state that gets passed around which we used to build up the NFA
data ParseContext = Context
                    {
                      nodeList :: [Node],
                      transitions :: [Transition],
                      operators :: OperatorList,
                      nextNode :: Node,
                      values :: Set Char
                    } deriving (Show, Eq)

-- Alias the State data constructor with a more friendly name
type RegexParseState a = State ParseContext a

This structure is passed between functions to allow them to see the current state of the parsing and create a new state. I define many functions, each which deal with a piece of the puzzle of converting the input string into a FSM. I am not going to address them all but I will point out some which is note worthy:

convertToNFA – This is the top level function, it is exposed externally and lets you convert a regex to a NFA.

processOperator – This function determines when we should execute an operator given its precedence. We assign each operator a precedence which lets us determine when we should execute an operator. For example in the expression a|b*, we want to execute star before we execute union.

Last but not least are the methods which execute the operators. For example, there is one called doConcat, which performs the concatenation of two values in the regular expression. doConcat isn’t pretty, since its doing the dirty work of examining the state and create a new state to reflect a partially completed FSM.

-- Execute the concat operator
doConcat :: RegexParseState ()
doConcat = do
  st <- get
  let nodes = nodeList st
      newNodes = (nodes !! 0) : (nodes !! 3) : (drop 4  nodes)
      newTransitions = transitions st ++ [(nodes !! 2, nodes !! 1, epsilon)]
      newOperators = tail $ operators st
  put $ st { nodeList = newNodes,
             transitions = newTransitions ,
             operators = newOperators}

With all this in place, lets finally see what this module actually outputs.

> convertToNFA "a|(bc)*"

FiniteMachine {table =
[(4,5,Just 'c'),(2,3,Just 'b'),(0,1,Just'a'),
(3,4,Nothing),(6,2,Nothing),(6,0,Nothing),
(1,7,Nothing),(5,7,Nothing),(8,6,Nothing),
(8,7,Nothing),(7,9,Nothing),(9,8,Nothing)],
alphabet = fromList "abc",
start = 8,
final = fromList [9]}

If you examine the table list in the output, you will see all the transitions for the NFA that accepts “a|(bc)*” and that the start state is node 8 and the accept state is node 9.

You can find this code in the RegexToNFA.hs file inside the [download id=”6″ format=”3″] package. I tried to comment it a good amount and I feel it should be pretty easy to read and understand.

In the next part I will talk about the next modules: NFAtoDFA

Matthew Manela

Building high quality software and teams

Writing a Regular Expression parser in Haskell: Part 2