Stripping HTML Tags using Regular ExpressionsBy Scott Mitchell
|To Lean More about Regular Expressions...|
|This article examines how to strip HTML tags using regular expressions. Regular expressions are nifty little buggers that can be used to perform advanced string pattern matching and replacing. To learn more about regular expressions be sure to check out the Regular Expressions Article Index. Also, your regular expressions questions can be answered at the Regular Expressions Forum at ASPMessageboard.com.|
Have you ever wanted to strip all of the HTML tags from a string? There are many reasons you may want to do this. For example, if you provide a feature on your site where a user can have the contents of a Web page emailed to them, you may wish to strip all of the HTML tags from the particular article for those users whose email client does not support HTML-formatted email.
A previous article on 4Guys by Abd Shomad, Stripping HTML Tags, provided a function
that accomplished this task. However, this function used a smattering of VBScript's various string functions:
Mid. While this is efficient and straightforward, personally
I find the approach both messy and incredibly hard to read. Additionally, Abd's function was incomplete in that
if you had an unmatching less-than sign, the function entered an infinite loop!
This new function utilizes regular expressions, making for a much cleaner function. Specifically, the non-greedy
regular expression is used. To learn more about
the non-greedy regular expression (
.*+), be sure to read Picking
Out Delimited Text with Regular Expressions and the FAQ
How can I count the number of words that appear in a
string? The non-greedy regular expression requires version 5.5 (or greater) of the server-side scripting
engines. To determine what server-side scripting language version you are using on your Web site, check out:
Determining the Server-Side Scripting Language and Version.
Phew! Now that we've got that out of the way, let's look at our
StripHTML function. This function
accepts a string input (the string whose HTML tags are to be stripped). The regular expression pattern
<(.|\n)+?> is used to get all matches of < and > characters with at least one character
in-between the tags. The
Replace method of the regular expression object is then used to replace all
instances with an empty string (
""). Finally, all remaining < and > signs are replaced with
their respective HTML encoded forms:
is shown below:
Pretty simple, really, eh? The regular expression pattern and the
Replace method do all the hard work
for you - no need to do any messy string operations. If you would rather not use regular expressions you
can use the
join functions in this clever approach. (To learn more about
split be sure to read: Parsing with
split.) The below function was sent in
courtesy of alert 4Guys reader Lonnie W. Kraemer:
Using either of these two approaches you can easily snip out the HTML tags in a string! Again, this technique would be very useful, among other things, in sending users the contents of a Web page in text format.
Two caveats when using either of these methods. First, if the script finds a < followed by, at some time, a >, it will assume that these represent an HTML tag. That is, if you have the HTML:
Did you know that 5 < 8 and 9 > 3?
StripHTML function will assume that the
< 8 and 9 > is an HTML tag and strip
it out, resulting in:
Did you know that 5 3?
However, if there is not a matching > for the <, this will not occur. That is:
Did you know that 5 < 8<br>?
will not have the < between the 5 and 8 stripped (since there is no matching > for the < between 5 and 8.
<br> tag will be stripped).
Another caveat: if you have ASP script tags nested inside HTML tags like:
attempting to strip the HTML tags will, using the regular expression method, leave the trailing
while using the
join method will leave:
<img src="">. I plan
on writing up another article in the future that accounts for this known "bug."