Published: Wednesday, April 25, 2001
Stripping HTML Tags using Regular Expressions
By Scott Mitchell
| To Lean More about Regular Expressions... |
|
This article examines how to strip HTML tags using regular expressions. Regular expressions are nifty little
buggers that can be used to perform advanced string pattern matching and replacing. To learn more about
regular expressions be sure to check out the Regular Expressions
Article Index. Also, your regular expressions questions can be answered at the
Regular Expressions Forum at
ASPMessageboard.com.
|
Have you ever wanted to strip all of the HTML tags from a string? There are many reasons you may want to
do this. For example, if you provide a feature on your site where a user can have the contents of a Web page
emailed to them, you may wish to strip all of the HTML tags from the particular article for those users whose email
client does not support HTML-formatted email.
A previous article on 4Guys by Abd Shomad, Stripping HTML Tags, provided a function
that accomplished this task. However, this function used a smattering of VBScript's various string functions:
Left, InStr, and Mid. While this is efficient and straightforward, personally
I find the approach both messy and incredibly hard to read. Additionally, Abd's function was incomplete in that
if you had an unmatching less-than sign, the function entered an infinite loop!
This new function utilizes regular expressions, making for a much cleaner function. Specifically, the non-greedy
regular expression is used. To learn more about
the non-greedy regular expression (.*? and .*+), be sure to read Picking
Out Delimited Text with Regular Expressions and the FAQ
How can I count the number of words that appear in a
string? The non-greedy regular expression requires version 5.5 (or greater) of the server-side scripting
engines. To determine what server-side scripting language version you are using on your Web site, check out:
Determining the Server-Side Scripting Language and Version.
Phew! Now that we've got that out of the way, let's look at our StripHTML function. This function
accepts a string input (the string whose HTML tags are to be stripped). The regular expression pattern
<(.|\n)+?> is used to get all matches of < and > characters with at least one character
in-between the tags. The Replace method of the regular expression object is then used to replace all
instances with an empty string (""). Finally, all remaining < and > signs are replaced with
their respective HTML encoded forms: < and >. The StripHTML function
is shown below:
Function stripHTML(strHTML)
'Strips the HTML tags from strHTML
Dim objRegExp, strOutput
Set objRegExp = New Regexp
objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "<(.|\n)+?>"
'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Replace(strHTML, "")
'Replace all < and > with < and >
strOutput = Replace(strOutput, "<", "<")
strOutput = Replace(strOutput, ">", ">")
stripHTML = strOutput 'Return the value of strOutput
Set objRegExp = Nothing
End Function
|
[
View a live demo!]
Pretty simple, really, eh? The regular expression pattern and the Replace method do all the hard work
for you - no need to do any messy string operations. If you would rather not use regular expressions you
can use the split and join functions in this clever approach. (To learn more about
join and split be sure to read: Parsing with
join and split.) The below function was sent in
courtesy of alert 4Guys reader Lonnie W. Kraemer:
Function stripHTML(strHTML)
'Strips the HTML tags from strHTML using split and join
'Ensure that strHTML contains something
If len(strHTML) = 0 then
stripHTML = strHTML
Exit Function
End If
dim arysplit, i, j, strOutput
arysplit = split(strHTML, "<")
'Assuming strHTML is nonempty, we want to start iterating
'from the 2nd array postition
if len(arysplit(0)) > 0 then j = 1 else j = 0
'Loop through each instance of the array
for i=j to ubound(arysplit)
'Do we find a matching > sign?
if instr(arysplit(i), ">") then
'If so, snip out all the text between the start of the string
'and the > sign
arysplit(i) = mid(arysplit(i), instr(arysplit(i), ">") + 1)
else
'Ah, the < was was nonmatching
arysplit(i) = "<" & arysplit(i)
end if
next
'Rejoin the array into a single string
strOutput = join(arysplit, "")
'Snip out the first <
strOutput = mid(strOutput, 2-j)
'Convert < and > to < and >
strOutput = replace(strOutput,">",">")
strOutput = replace(strOutput,"<","<")
stripHTML = strOutput
End Function
|
[
View a live demo!]
Using either of these two approaches you can easily snip out the HTML tags in a string! Again, this technique
would be very useful, among other things, in sending users the contents of a Web page in text format.
Two caveats when using either of these methods. First, if the script finds a < followed by, at some time,
a >, it will assume that these represent an HTML tag. That is, if you have the HTML:
Did you know that 5 < 8 and 9 > 3?
the StripHTML function will assume that the < 8 and 9 > is an HTML tag and strip
it out, resulting in:
Did you know that 5 3?
However, if there is not a matching > for the <, this will not occur. That is:
Did you know that 5 < 8<br>?
will not have the < between the 5 and 8 stripped (since there is no matching > for the < between 5 and 8.
(However, the <br> tag will be stripped).
Another caveat: if you have ASP script tags nested inside HTML tags like:
<img src="<%=Request("Value")%>">
|
attempting to strip the HTML tags will, using the regular expression method, leave the trailing ">,
while using the split and join method will leave: <img src="">. I plan
on writing up another article in the future that accounts for this known "bug."
Happy Programming!
By Scott Mitchell