When you think ASP, think...
Recent Articles
All Articles
ASP.NET Articles
ASPFAQs.com
Message Board
Related Web Technologies
User Tips!
Coding Tips
Search

Sections:
Book Reviews
Sample Chapters
Commonly Asked Message Board Questions
JavaScript Tutorials
MSDN Communities Hub
Official Docs
Security
Stump the SQL Guru!
Web Hosts
XML
Information:
Advertise
Feedback
Author an Article
Jobs

ASP ASP.NET ASP FAQs Message Board Feedback ASP Jobs
 
Print this Page!
Published: Wednesday, February 26, 2003

Regular Expressions in .NET

By Darren Neimke


For More on Regular Expressions...
The purpose of this article is to build upon the existing pool of regular expression articles by providing an overview of the new regular expression features found in .NET and to offer some guidelines as to when and how to use them. The reader of this article should be familiar with what regular expressions are and their base features.

If you are new to regular expressions, check out the Regular Expressions Article Index. A great beginner-level article on RegExs can be found at: An Introduction to Regular Expressions with VBScript. There is also a Regular Expressions FAQ Category over at ASPFAQs.com.

Introduction


Although I was familiar enough with the basic concepts of regular expressions to use them in VBScript and JScript, I noticed that I was struggling to understand many regular expressions I found in examples and documentation. Some of the new features such as lookaround and named capturing left me feeling more than a little overwhelmed. In addition to this, the documentation for regular expressions was scant and quite often with little or no sample code. Because of this, I initially steered away from using regular expressions in my .NET projects altogether.

In this article I hope to highlight some of these new areas and hopefully de-mystify them in such a way that you won't find yourself in the position that I did.

Matching: Groups and Named Captures


From previous regular expression authoring you will likely be familiar with the concept of referencing parenthesized captures via the $1...$N notation - these are referred to as backreferences. To demonstrate this, consider the following VB.NET sample:

Dim userName As String = "Neimke, Darren"
Dim re As New RegEx( "(\w+),\s(\w+)" )
userName = re.Replace( userName, "$2 $1" )
Response.Write( userName )

The above pattern matches two words separated by a comma and a space, captures the surname and the firstname of a user and formats them in firstname, surname order. The result is that the value "Darren Neimke" would be displayed in the browser.

In the Replace statement the $N notation refers to the Nth group of parenthesis (captures). An important point to note is that, in .NET the zeroth element ($0) refers to the entire matched text - "Neimke, Darren" in the case of the above example.

The Regex class now offers some convenient shared (static) members that allow simple statements to be in-lined, thus reducing the need for unneccessarily bulky code structures such as the one shown above. The useful static members are: IsMatch, Match, Matches, Replace and Split. Using this syntax allows for the previous code to be reduced to:

Dim userName As String = "Neimke, Darren"
userName = Regex.Replace( userName, "(\w+),\s(\w+)", "$2 $1" )

The reduced code benefits can be further seen with another example, using IsMatch() to ensure that a string contains a Decimal number pattern before executing some code:

If Regex.IsMatch( userInputString, "\d+(\.?\d+)" ) Then
    ' perform some conversion and math operations here
End If

Prior to .NET, a regular expressions Match object contained many SubMatches. This has remained the same in .NET although they are now referred to as Groups. Groups are a collection property of a Match object and each captured group can be accessed via it's index (remembering that index 0 refers to the entire match), like so:

Dim userName As String = "Neimke, Darren"
Response.Write( Regex.Match( username, "(\w+),\s(\w+)" ).Groups(2).ToString() )

This would display the text "Darren" as it is the captured Group at index 2.

Named Captures


Additionally Groups can be assigned names via the new (?<nameOfGroup>...) or (?'nameOfGroup'...) syntax. For consistency with other flavors of regular expressions - such as Perl - I prefer the first syntax and it is the one that is most commonly used. Assigning names to groups helps to make your code more self-describing and can lead to improved maintainability. Here's an example of naming the two captures:

Dim userName As String = "Neimke, Darren"
Dim pattern As String = "(?<surname>(\w+)),\s(?<firstname>(\w+))"
Response.Write( Regex.Match( userName, pattern ).Groups("firstname").ToString() )

Displays "Darren".

Non-Capturing


While captures provide a lot of power, they can incur quite a performance hit. With regular expressions in VBScript and JScript, capturing occurred whenever you used parenthesis in a regular expression pattern. Sometimes, though, you need to use parenthesis, but you don't need capturing. For example, if you wanted to match either "Let's go this way" or "Let's go that way" you could use the following regular expression:

Let\'s go th(is|at) way

The parentheses with the pipe indicate an option. The pattern matches either "is" or "at" after the "th". Unfortunately, this regular expression incurs an unneeded performance hit because the captured text (either "is" or "at") is remembered via a backreference.

Fortunately, .NET regular expressions provide the (?:...) syntax, which allows for grouping to be done without incurring the performance hit of captured text being "remembered" as a backreference. Using this syntax, the above regular expression could be changed to:

Let\'s go th(?:is|at) way

That pattern would match either:

  • "Let's go this way"
  • "Let's go that way"

But would only contain one captured group, referenced as Groups(0). This can obviously lead to significant performance gains, especially when complex patterns are applied to even moderately large bodies of text.

Lookaround


Lookaround is a feature that is partially implemented in JScript but not in VBScript. There are two directions of lookaround - lookahead and lookbehind - and two flavors of each direction - positive assertion and negative assertion. The syntax for each is:

  • (?=...) - Positive lookAHEAD
  • (?!...) - Negative lookAHEAD
  • (?<=...) - Positive lookBEHIND
  • (?<!...) - Negative lookBEHIND

Understanding look(ahead|behind) requires an understanding of the difference between matching text and matching position. To help with this understanding I should state first that lookaround assertions are non-consuming. To see what I mean, let's look at the following simple example.

pattern = "test"
text = "testing"

When the above pattern is applied to the text the "context" of the parser sits at a position in the text between the "t" and the "i" in the word testing. This is because the regular expression parser bumps along the string as it gets a match, like so:

  1. Start - ^testing
  2. Match "t" - t^esting
  3. Match "e" - te^sting
  4. Match "s" - tes^ting
  5. Match "t" - test^ing

Once the parser has moved beyond a position there is no way to reverse up and re-attempt a match. To understand where this causes difficulty, consider this, what if you needed to match the word "test" but only when it was contained in the word "tested" and not any other possible combination such as "tester". With lookahead you can simply assert that condition like so: (?=tested\b)test

This works because, with lookaround, the parser is not bumped along the string. This can be especially useful for finding a position in a document by combining a lookahead assertion with a lookbehind assertion. To demonstrate, let's consider that we need to match the string "test" when it was contained within the string "protested" but not "detested". To do this you can do a negative, lookbehind assertion on "de" and a positive lookahead assertion on "tested", like this: (?<!de)(?=tested\b)test

In other words you are matching a position at which to start matching text. The above pattern would set the parser at the following position in the string "protested"

  1. Start - pro^tested
  2. Match "t" - prot^ested
  3. Match "e" - prote^sted
  4. Match "s" - protes^ted
  5. Match "t" - protest^ed

Another good example of using lookaround would be to validate "special" password conditions such as: "Password must be between 8 and 20 characters, must contain at least 2 letter characters and at least 2 digit characters. It can only contain either letter or digit characters."

For such a password constraint, the following expression would probably do quite nicely: ^(?=.*?\d.*?\d)(?=.*?\w.*?\w)[\d\w]{8,20}$

Readability and Maintainability


One of my personal favorite new features is the ability to have embedded comments in regular expressions. Most of us will have, at one time or another come across a regular expression that looks somewhat like this:

Dim re As New Regex( "(?<=(#|@))(?=\w+)\w+\b", RegexOptions.Multiline )

If you are lucky you might find a comment that alludes to the purpose of the regular expression, but, when the time comes to maintain the expression you are undoubtedly left with a sense of anxiety and, more often than not, a complete re-write is undertaken as opposed to some minor maintenance operation. .NET allows regular expression patterns to be authored with embedded comments via the RegExOptions.IgnorePatternWhitespace compiler option and the (?#...) syntax embedded within each line of the pattern string.

This allows for psuedo-code-like comments to be embedded in each line and has the following affect on readability:

Dim re As New Regex ( _
    "(?<=		(?# Start a positive lookBEHIND assertion ) " & _
    "(#|@)		(?# Find a # or a @ symbol ) " & _
    ")			(?# End the lookBEHIND assertion ) " & _
    "(?=		(?# Start a positive lookAHEAD assertion ) " & _
    "	\w+		(?# Find at least one word character ) " & _
    ")			(?# End the lookAHEAD assertion ) " & _
	"\w+\b		(?# Match multiple word characters leading up to a word boundary)", _
    RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _
)

Delegates


Finally, a really useful addition to the .NET Framework is that the Regex.Replace() method allows the use of a delegate as the "replacement" argument. To understand what I'm talking about, consider the following snippet:

Dim myString As String = RegEx.Replace( "a true taste of the temperature", "t.*?e\b", "a" )

After the replace operation has occurred, the value of myString will be "a a a of a a" and it's fairly obvious what happened. Every time the regular expression parser found a match within the string it replaced it with the letter "a". That's all nice and easy if all you need to do is a straight replace, but what about if you need to implement some sort of business logic into the check or you need to "touch" the sub-matches in some way and re-build the replaced string.

A good enough example is converting all words within a body of text to proper case (i.e. first letter capitalized). To do this your first instincts might be to create a pattern like so: \b(\w)(\w+)?\b. You could then enumerate the matches, convert the first sub-match to its uppercase version, join the sub-matches and re-append them to a StringBuilder instance, like so:

mc = re.Matches( bodyOfText )
Dim m As Match
For Each m In mc
   sb.AppendFormat("{0}{1}", m.Groups(1).Value.ToUpper(), m.Groups(2).Value)
Next

That would work fine if your string contained only word characters, but, what if it looked like this: ~~~ This %%% is ### a chunk of text. After the replacement operation you would end up with the following string meaning that all non-word characters that didn't participate in the matches were dropped: ThisIsAChunkOfText. There are ways around it, mostly by building bigger, more complex patterns and doing more string building inside the match collection iteration.

A more elegant solution is to wire-up a MatchEvaluator delegate. You can think of a MatchEvaluator as an event handler that fires when an "OnMatch event" occurs. You provide the MatchEvaluator with a pointer (reference) to handler function and that function will be called each time a match is encountered. The function must take a Match parameter as its single argument and must return a String back to the regular expression Replace method that invoked it. This method of replacement allows you the flexibility to do all sorts of operations transparently to the Replace method itself, and because it is all handled within the Replace method call, you are not left with having to re-build a string as in the previous example.

A demonstration is in order - let's re-write our previous failed attempt at converting a string to proper case using delegates:

Sub Page_Load(sender as Object, e as EventArgs)
    Dim myDelegate As New MatchEvaluator( AddressOf MatchHandler )
    Dim sb As New System.Text.Stringbuilder()
    Dim bodyOfText As String = _
        "~~~ This %%% is ### a chunk of text."
        
    Dim pattern As String = "\b(\w)(\w+)?\b"
    Dim re As New Regex( _
        pattern, RegexOptions.Multiline Or _
        RegexOptions.IgnoreCase _
    )
    Dim newString As String = re.Replace(bodyOfText, myDelegate)
        
    Response.Write( bodyOfText & "<hr>" & newString )
End Sub

Private Function MatchHandler( ByVal m As Match ) As String
    Return m.Groups(1).Value.ToUpper() & m.Groups(2).Value
End Function
[View a Live Demo!]

As you can see, the separation is much cleaner and having the replacement logic handled in a separate handler method allows you to implement very complicated operations without affecting readability, maintainability or - and most importantly - data integrity as a result of missing data in a string re-building operation.

Conclusion


Novice programmers often tend to rely heavily on inelegant, unweildy, or slow solutions that focus heavily on string handling operations; programmers with a higher command of languages are more commonly turning to regular expressions to manage and manipulate chunks of text.

The .NET flavor of regular expressions allows regular expressions to be written in a more efficient and maintainable manner. While learning and mastering regular expressions takes time, the ultimate reward is an increased ability to provide accurate solutions efficiently.

There is a sample ASP.NET Web page that uses many of the advanced features discussed in this article that you can try out. Specifically, the sample Web page retrieves the HTML from a remote Web server and then prefixes a URL to all hyperlinks that do not start with http://.

Happy Programming!

  • By Darren Neimke



  • ASP.NET [1.x] [2.0] | ASPMessageboard.com | ASPFAQs.com | Advertise | Feedback | Author an Article