Archive

Archive for the ‘Regular Expression’ Category

Inline Regular Expression Options

January 1st, 2009
Comments Off

I was using attributes from the System.ComponentModel.DataAnnotations   namespace for model validation.  This namespace includes a few very useful validation attributes such as

    1. Required Attribute – Validates the field has a value
    2. Range Attribute – Validates the field is within a given range
    3. RegularExpression Attribute – Validates the field matches a given regular expression.

 

The regular expression attribute is very useful since you can describe exactly what format you want a string property to be in.  While using this though I ran into a problem.  The attribute doesn’t let you specify RegexOptions.  This was an issue for me since I wanted to use the regex to validate that the users input was between 5 and 200 characters long , so I had attribute that property as such:

   1: [RegularExpression(@".{5,200}")]
   2: public string Text{get;set;}

 

However this doesn’t work since by default the wildcard . does not match new lines (which I am allowing in the input).  The way to fix this is to specify the RegexOptions.SingleLine option to either the Regex constructor or Match function.  The problem is I have no way of doing that here, and there is no argument on the attribute constructor to specify those options.  I was considering overriding the attribute to create one that allows specifying the attribute but then I stumbled upon this:

Regular Expression Options

You are able to specify the regex option inside of the regular expression text! (which I thought was a huge discovery until my co-worker said he knew this all along but never let me know!). 

So I just changed the expression to look like this:

   1: [RegularExpression(@"(?s).{5,200}")]
   2: public string Text{get; set;}

The (?s) is the inline regex option definition so say I want this in SingleLine mode! And now the validation works the way I wanted!

Author: MattManela Categories: C#, Regular Expression Tags:

Inline Regular Expression Options

January 1st, 2009
Comments Off

I was using attributes from the System.ComponentModel.DataAnnotations   namespace for model validation.  This namespace includes a few very useful validation attributes such as

    1. Required Attribute – Validates the field has a value
    2. Range Attribute – Validates the field is within a given range
    3. RegularExpression Attribute – Validates the field matches a given regular expression.

 

The regular expression attribute is very useful since you can describe exactly what format you want a string property to be in.  While using this though I ran into a problem.  The attribute doesn’t let you specify RegexOptions.  This was an issue for me since I wanted to use the regex to validate that the users input was between 5 and 200 characters long , so I had attribute that property as such:

   1: [RegularExpression(@".{5,200}")]
   2: public string Text{get;set;}

 

However this doesn’t work since by default the wildcard . does not match new lines (which I am allowing in the input).  The way to fix this is to specify the RegexOptions.SingleLine option to either the Regex constructor or Match function.  The problem is I have no way of doing that here, and there is no argument on the attribute constructor to specify those options.  I was considering overriding the attribute to create one that allows specifying the attribute but then I stumbled upon this:

Regular Expression Options

You are able to specify the regex option inside of the regular expression text! (which I thought was a huge discovery until my co-worker said he knew this all along but never let me know!). 

So I just changed the expression to look like this:

   1: [RegularExpression(@"(?s).{5,200}")]
   2: public string Text{get; set;}

The (?s) is the inline regex option definition so say I want this in SingleLine mode! And now the validation works the way I wanted!

Author: MattManela Categories: C#, Regular Expression Tags:

Writing a Regular Expression parser in Haskell: Part 4

March 11th, 2008

With the previous two modules in place we are now set up to use a DFA to match against a string.  In my implementation I support either a greedy match or an short match.  In a full featured regular expression engine this ability to choose greedy or not would be per operator but for simplicity I have it for the overall match. 

 

To do the matching I have a general function which will create a list of all matches.  Then the difference between short and greedy matching is which of the candidate solutions does it choose.

This is the method:

   1: doMatch func machine st [] = doAccept  machine st []
   2: doMatch func machine st string =  func $ map (\f -> doMatch’ st f []) (tails string)
   3:     where
   4:       doMatch’ state [] soFar = doAccept machine st soFar
   5:       doMatch’ state (s:str) soFar = 
   6:           case findTransition machine s state of
   7:             Nothing -> doAccept machine state soFar
   8:             Just (from, to, val) -> case doMatch’ to str (soFar ++ [s]) of
   9:                                       (False,_) -> case canAccept machine to of
  10:                                                     True -> (True, soFar ++ [s])
  11:                                                     False -> doMatch’ to str (soFar ++ [s])
  12:                                       (True,res) -> (True,res)

 

This creates the list of matches and uses the passed in function to determine how to filter to either the shortest or longest match.

For short or long matches I pass in one of these two functions:

   1: — Get the shortest match
   2: shortest matches = case  filter (\s->fst s) (sort matches) of
   3:                      [] -> (False,”")
   4:                      ms -> head ms
   5:  
   6: — Get the longest match
   7: longest matches = last.sort $ matches

 

I created aliases for the functions to make it more handy:

   1: (=~) = greedyMatch
   2: (=~?) = shortMatch

 

And then the final result:

   1: SimpleRegex> “hiphiphiphorray” =~? “hip(hip)
   2: (True,”hip”)
   3:  
   4: SimpleRegex> “hiphiphiphorray” =~ “hip(hip)
   5: (True,”hiphiphip”)

 

 

I attached a zip of all the files for this project.

Enjoy!

Author: Matt Categories: Haskell, Regular Expression Tags:

Writing a Regular Expression parser in Haskell: Part 3

March 11th, 2008

The third module in the simple regular expression parser is called: NFAtoDFA.  Which as you might have guessed, takes the NFA that resulted from the first module and converts it into a DFA.  The structure that the DFA uses is the same that the NFA uses since they are both finite state machines.

 

Converting an NFA to a DFA requires mapping sets of nodes in the NFA to a single node in the DFA.  Many nodes in a NFA will correspond to one node in the DFA.  Making this change requires updating transitions to point to and from sets of nodes.  To manage this transformation I create a state monad using the following context:

 

   1: — The state which we pass to build the DFA
   2: data ConvertContext = ConvertContext { nfa :: FiniteMachine,
   3:                                        trans :: [Transition],
   4:                                        setMap :: Map.Map (Set Node) Integer,
   5:                                        setStack :: [Set Node],
   6:                                        begin :: Node,
   7:                                        accept :: Set Node,
   8:                                        nextNode :: Node
   9:                                      } deriving (Show, Eq)
  10: type ConvertState a = State ConvertContext a

 

Most of the code in this module is just managing this context and updating it according to two operation:

  1. Epsilon Closure
  2. Set Move

These are explained in more detail in this article.

Basically, epsilon closure is the process of taking a set of initial nodes and returning a new set of all nodes you can traverse to purely on epsilon transitions.  To help with this I created some smaller methods to build up to an epsilon closure.

First are a couple methods (findToNodes and closure):

   1: closure trans value nodes oldSet = Set.union (findToNodes trans value nodes) oldSet
   2:  
   3: — Search the table of transitions to find all nodes you can reach given an initial set of nodes
   4: findToNodes trans value fromNodes = foldr match Set.empty trans
   5:     where 
   6:       match (from, to, val) nodes
   7:           | (from == fromNodes) && (val == value) = Set.insert to nodes
   8:           | otherwise = nodes

findToNodes searches a transition table for all nodes which go from any node in fromNodes on value.  It will builds up a set with all the to  nodes that match. 

closure wraps findToNodes to let us easily union together an initial set and the nodes we can reach from that set.

With this in hand we can write clearly a epsilon closure function:

   1: — Given an initial set of nodes, find the set of all nodes you can reach by taking 
   2: — transitions on epsilon only
   3: epsilonClosure trans nodes = foldUntilRepeat Set.union Set.empty $
   4:                              iterate (Set.fold (closure trans epsilon) Set.empty) nodes
   5:  
   6:  

 

This function takes full advantage of the lazy nature of Haskell.  It repeats the closure on epsilon over and over and streams its results into our function foldUntilRepeat.  This method does what it says, it will fold the values that are streamed in until it sees the same value twice. 

 

The set move is just combination of what you have already seen:

   1: — Given a starting set of nodes the set of all nodes that you can reach on a given value
   2: — This includes epislonClosure on the desitination nodes
   3: moveClosure trans value nodes = epsilonClosure trans $ 
   4:                                 Set.fold (closure trans value) Set.empty nodes
   5:  

 

With these functions in hand, this module just becomes calling them and updating the context until we have no more nodes in the NFA to process.

 

In the next installment I will discuss using the output of this modules to match a regex against a string.

Also, once again the code is attached.

Author: Matt Categories: Haskell, Regular Expression Tags:

Writing a Regular Expression parser in Haskell: Part 2

March 11th, 2008

The first module in my simple regular expression parse is called RegexToNFA.  This module exposes the types that make up a finite state machine and also the functions to convert a regular expression string into a finite state machine.

My structure for a FSM follows closely from the mathematical definition:

   1: data FiniteMachine = FiniteMachine{  table :: [Transition],
   2:                                      alphabet :: Set Char,
   3:                                      start :: Node,
   4:                                      final :: Set Node
   5:  
   6:  
   7: — NFA node
   8: type Node = Integer
   9:  
  10: — The value for an edge in a NFA
  11: type TransitionValue = Maybe Char
  12:  
  13: — A transition in a NFA is a tuple of
  14: — StartNode , DestinationNode, Value to transition on
  15: type Transition = (Node,Node,TransitionValue)
  16:  
  17: — The value of the edge in the NFA is a Maybe Char 
  18: — Where Nothing is the epsilon transition
  19: — therefore lets just rename Nothing to epsilon

 

I have the value which you transition on as a Maybe Char (which I alias as TransitionValue).  This allowed me to define epsilon as Nothing data constructor. 

 

With this structure defined my goal now is to convert a regular expression pattern such as: (a|b)* into a FiniteMachine.  In order to do this there is a lot of state that I need to keep track of which naturally leads to the use of the State monad.  To do this I set up a structure for what data I want to be kept track of and then create a state monad using that structure:

   1: — The state that gets passed around which we used to build up the NFA
   2: data ParseContext = Context 
   3:                     {
   4:                       nodeList :: [Node],
   5:                       transitions :: [Transition],
   6:                       operators :: OperatorList,
   7:                       nextNode :: Node,
   8:                       values :: Set Char
   9:                     } deriving (Show, Eq)
  10:  
  11: — Alias the State data constructor with a more friendly name
  12: type RegexParseState a = State ParseContext a

 

This structure is passed between functions to allow them to see the current state of the parsing and create a new state.  I define many functions, each which deal with a piece of the puzzle of converting the input string into a FSM.  I am not going to address them all but I will point out some which is note worth:

convertToNFA – This is the top level function, it is exposed externally and lets you convert a regex to a NFA.

processOperator – This function determines when we should execute an operator given its precedence.  We assign each operator a precedence which lets us determine when we should execute an operator.  For example in the expression a|b*, we want to execute star before we execute union. 

 

Last but not least are the methods which execute the operators.  For example, there is one called doConcat, which performs the concatenation of two values in the regular expression. doConcat isn’t pretty, since its doing the dirty work of examining the state and create a new state to reflect a partially completed FSM.

   1: — Execute the concat operator
   2: doConcat :: RegexParseState ()
   3: doConcat = do
   4:   st <- get
   5:   let nodes = nodeList st
   6:       newNodes = (nodes !! 0) : (nodes !! 3) : (drop 4  nodes)
   7:       newTransitions = transitions st ++ [(nodes !! 2, nodes !! 1, epsilon)]
   8:       newOperators = tail $ operators st
   9:   put $ st { nodeList = newNodes,
  10:              transitions = newTransitions ,
  11:              operators = newOperators}

 

With all this in place, lets finally see what this module actually outputs.

> convertToNFA “a|(bc)*”

FiniteMachine {table =
[(4,5,Just 'c'),(2,3,Just 'b'),(0,1,Just'a'),
(3,4,Nothing),(6,2,Nothing),(6,0,Nothing),
(1,7,Nothing),(5,7,Nothing),(8,6,Nothing),
(8,7,Nothing),(7,9,Nothing),(9,8,Nothing)],
alphabet = fromList “abc”,
start = 8,
final = fromList [9]}

If you examine the table list in the output, you will see all the transitions for the NFA that accepts “a|(bc)*” and that the start state is node 8 and the accept state is node 9. 

I uploaded the RegexToNFA.hs file for your examination.  I tried to comment it a good amount and I feel it should be pretty easy to read and understand.

 

In the next part I will talk about the next modules: NFAtoDFA

Author: Matt Categories: Haskell, Regular Expression Tags:

Writing a Regular Expression parser in Haskell: Part 1

March 11th, 2008

A few weeks ago I read this article about writing a simple regular expression parser.  That article does a really good job of explaining the theory behind regular expression.  It then goes step by step into how to write a program (he uses C++) to parse a regular expression, convert it into a NFA, convert that into a DFA and then use that DFA to match strings.

After reading that I decided to write my own simple regular expression parser using Haskell.  I saw it as a challenge to try to see how you deal with a more complex program in a pure functional language.  After a couple weeks ( Grand Theft Auto 4 kind of ruined my progress for a while ) I have some results.

I split the project into 3 modules.

  1. RegexToNFA – Provides functionality to parse a simple regular expression and return a NFA.
    1. This modules also define the FiniteMachine type which is a general structure for finite state automata.

  2. NFAtoDFA – Providers functionality to convert a NFA into a DFA.
    1. This module uses the same FiniteMachine type from RegexToNFA

  3. SimpleRegex – Provides the functionality to give take a regular expression and a string and return what it matches (if it matches anything).
    1. This modules uses RegexToNFA and sends its results to NFAtoDFA and then uses the resulting DFA to match against a string.

 

This is a very simple and limited regular expression parser.  It supports only union(|), concatenation, closure(*) and parenthesis.  In addition, I don’t preserve information after the NFA is created about the location of the parenthesis.  This means you can’t pull out sub-matches when a entire expression matches.

In my next three I will talk about each module and point out interesting parts of them.  There is nothing too complex but shows how to approach it in Haskell (making heavy use of the State monad).

 

If you want to see a much more complex and full featured regular expression parser written in Haskell take a look at this.

 

Click here to continue to Part 2.

Author: Matt Categories: Haskell, Regular Expression Tags: