Published: Wednesday, March 15, 2000
Utilizing Regular Expressions, Part 2
Read Part 1
In
Part 1 we discussed a high-level overview of
using the
RegExp object in VBScript, as well as special regular expression
characters that could be used for position matching. In this part, we'll examine other
special characters for regular expressions!
Character Classes
Regular expression contains several special characters to help search for a particular
set of characters. The most versatile special character is the braces ([]).
Braces allow the matching of a specific set of characters. For example, if you wanted
to determine if a string contained any vowels, you could use the following regular
expression:
Note that the braces search for a single character. The number of characters within
the braces indicates what the valid characters to search for are, but, again, single
characters are checked. If you want to determine if a string does not contain
a particular character, place a carrot (^) before listing any characters.
For example, to find out if a string contained no vowels, the following regular
expression could be used:
The hyphen character can be used to denote a range of characters. For example, if you
wanted to determine if a string contained any uppercase alphabetical characters, you could use
the following regular expression:
To match any single character, use the period (.). For example,
4.uys would match strings like 4Guys, 45uys,
and 4zuys. To match a single "word character," use \w.
A word character is defined as any alphanumeric character or an underscore
(that is, [a-zA-Z_0-9]). \W is the inverse of \w,
matching any non-alphanumeric character (that is, [^a-zA-Z_0-9]).
As mentioned earlier in this article, \d matches any single digit, and
is synonymous to [0-9]. It's inverse, \D, matches any non-digit,
and is synonymous to [^0-9]). \s matches any whitespace
character, such as a new-line character (which is represented as \n), a
carraige return (\r) or a tab (\t). \S, on the
other hand, matches any non-whitespace character.
Character classes are really useful for matching complex patterns. For example, imagine
that we wanted to determine if a phone number was valid or not. If we required that phone
numbers be in the format (###) ###-####, we could use the following regular
expression:
^\(\d\d\d\) \d\d\d-\d\d\d\d$
|
Note the \( and \). These characters search for a literal
left and right parenthesis, respectively. If you wish to search for a literal of a character
that also has special meaning (like a parenthesis, a period, a brace, etc.), you must
prefix that character with a backslash.
This is useful in form validation (more on this a little later).
repetition
There are several special symbols that can be used to search for repeating substrings
or patterns. The curly braces ({n}) searches for exactly n repetitions of
the substring is follows. For example, to search for three consecutive digits, the
following regular expression could be used:
The curly braces can also accept a second parameter, like {n,p}. When using
two parameters with the curly braces, you are indicating that you are willing to accept
a certain range of repetitions: the first parameter is the lower bound, while the second
parameter is the upper bound. So, to match four to six vowels in succession, the following
regular expression could be used:
If the second parameter - the upper bound - is left off, the regular expression searches
for n or more occurrences. For example, if we wanted to match four or more successive vowels,
the regular expression would be adjusted to [aeiou]{4,}
The question mark (?) matches zero or one occurrences, synonymous to
{0,1}. The asterisk (*) matches zero or more occurrences
({0,}) while the plus sign (+) matches one or more occurences
({1,}).
Repition matching is another powerful facet of regular expressions. For example, in
an earlier example we demonstrated how to match a phone number. What if we wanted to
make the area code optional? We could adjust the regular expression to:
^(\(\d\d\d\) )?\d\d\d-\d\d\d\d$
|
Note that the parenthesis around the area code group the entire \(\d\d\d\)
so the regular expression parser knows where to apply the ? special symbol.
Of course, with the curly braces we could pretty up the regular expression a bit to:
^(\(\d{3}\) )?\d{3}-\d{4}$
|
Read Part 3
Read Part 1