When you think ASP, think...
Recent Articles
All Articles
ASP.NET Articles
ASPFAQs.com
Message Board
Related Web Technologies
User Tips!
Coding Tips

Sections:
Sample Chapters
Commonly Asked Message Board Questions
JavaScript Tutorials
MSDN Communities Hub
Official Docs
Security
Stump the SQL Guru!
XML Info
Information:
Feedback
Author an Article
ASP ASP.NET ASP FAQs Message Board Feedback
Print this page.
Published: Wednesday, April 25, 2001

Stripping HTML Tags using Regular Expressions

By Scott Mitchell


To Lean More about Regular Expressions...
This article examines how to strip HTML tags using regular expressions. Regular expressions are nifty little buggers that can be used to perform advanced string pattern matching and replacing. To learn more about regular expressions be sure to check out the Regular Expressions Article Index. Also, your regular expressions questions can be answered at the Regular Expressions Forum at ASPMessageboard.com.

- continued -

Have you ever wanted to strip all of the HTML tags from a string? There are many reasons you may want to do this. For example, if you provide a feature on your site where a user can have the contents of a Web page emailed to them, you may wish to strip all of the HTML tags from the particular article for those users whose email client does not support HTML-formatted email.

A previous article on 4Guys by Abd Shomad, Stripping HTML Tags, provided a function that accomplished this task. However, this function used a smattering of VBScript's various string functions: Left, InStr, and Mid. While this is efficient and straightforward, personally I find the approach both messy and incredibly hard to read. Additionally, Abd's function was incomplete in that if you had an unmatching less-than sign, the function entered an infinite loop!

This new function utilizes regular expressions, making for a much cleaner function. Specifically, the non-greedy regular expression is used. To learn more about the non-greedy regular expression (.*? and .*+), be sure to read Picking Out Delimited Text with Regular Expressions and the FAQ How can I count the number of words that appear in a string? The non-greedy regular expression requires version 5.5 (or greater) of the server-side scripting engines. To determine what server-side scripting language version you are using on your Web site, check out: Determining the Server-Side Scripting Language and Version.

Phew! Now that we've got that out of the way, let's look at our StripHTML function. This function accepts a string input (the string whose HTML tags are to be stripped). The regular expression pattern <(.|\n)+?> is used to get all matches of < and > characters with at least one character in-between the tags. The Replace method of the regular expression object is then used to replace all instances with an empty string (""). Finally, all remaining < and > signs are replaced with their respective HTML encoded forms: &lt; and &gt;. The StripHTML function is shown below:

Function stripHTML(strHTML)
'Strips the HTML tags from strHTML

  Dim objRegExp, strOutput
  Set objRegExp = New Regexp

  objRegExp.IgnoreCase = True
  objRegExp.Global = True
  objRegExp.Pattern = "<(.|\n)+?>"

  'Replace all HTML tag matches with the empty string
  strOutput = objRegExp.Replace(strHTML, "")
  
  'Replace all < and > with &lt; and &gt;
  strOutput = Replace(strOutput, "<", "&lt;")
  strOutput = Replace(strOutput, ">", "&gt;")
  
  stripHTML = strOutput    'Return the value of strOutput

  Set objRegExp = Nothing
End Function
[View a live demo!]

Pretty simple, really, eh? The regular expression pattern and the Replace method do all the hard work for you - no need to do any messy string operations. If you would rather not use regular expressions you can use the split and join functions in this clever approach. (To learn more about join and split be sure to read: Parsing with join and split.) The below function was sent in courtesy of alert 4Guys reader Lonnie W. Kraemer:

Function stripHTML(strHTML)
'Strips the HTML tags from strHTML using split and join

  'Ensure that strHTML contains something
  If len(strHTML) = 0 then
    stripHTML = strHTML
    Exit Function
  End If

  dim arysplit, i, j, strOutput

  arysplit = split(strHTML, "<")
 
  'Assuming strHTML is nonempty, we want to start iterating
  'from the 2nd array postition
  if len(arysplit(0)) > 0 then j = 1 else j = 0

  'Loop through each instance of the array
  for i=j to ubound(arysplit)
     'Do we find a matching > sign?
     if instr(arysplit(i), ">") then
       'If so, snip out all the text between the start of the string
       'and the > sign
       arysplit(i) = mid(arysplit(i), instr(arysplit(i), ">") + 1)
     else
       'Ah, the < was was nonmatching
       arysplit(i) = "<" & arysplit(i)
     end if
  next

  'Rejoin the array into a single string
  strOutput = join(arysplit, "")
  
  'Snip out the first <
  strOutput = mid(strOutput, 2-j)
  
  'Convert < and > to &lt; and &gt;
  strOutput = replace(strOutput,">","&gt;")
  strOutput = replace(strOutput,"<","&lt;")

  stripHTML = strOutput
End Function
[View a live demo!]

Using either of these two approaches you can easily snip out the HTML tags in a string! Again, this technique would be very useful, among other things, in sending users the contents of a Web page in text format.

Two caveats when using either of these methods. First, if the script finds a < followed by, at some time, a >, it will assume that these represent an HTML tag. That is, if you have the HTML:

  Did you know that 5 < 8 and 9 > 3?

the StripHTML function will assume that the < 8 and 9 > is an HTML tag and strip it out, resulting in:

  Did you know that 5  3?

However, if there is not a matching > for the <, this will not occur. That is:

  Did you know that 5 < 8<br>?

will not have the < between the 5 and 8 stripped (since there is no matching > for the < between 5 and 8. (However, the <br> tag will be stripped).

Another caveat: if you have ASP script tags nested inside HTML tags like:

<img src="<%=Request("Value")%>">

attempting to strip the HTML tags will, using the regular expression method, leave the trailing ">, while using the split and join method will leave: <img src="">. I plan on writing up another article in the future that accounts for this known "bug."

Happy Programming!

  • By Scott Mitchell


  • ASP.NET [1.x] [2.0] | ASPMessageboard.com | ASPFAQs.com | Advertise | Feedback | Author an Article