Translating from a Custom Markup to HTML using Regular Expressions
By Scott Mitchell
Introduction
If you run a messageboard-type site, you may want to allow users to enter certain HTML tags, but not others.
For example, you may want users to be able to use tags like <b> and
<u>, but any other HTML tags would be displayed as-is. That is, if a user were to
enter as a message:
Hello. I am <b>really</b> confused when it comes to using the <SCRIPT> tag. Help please!
You'd like it to display as:
Hello. I am really confused when it comes to using the <SCRIPT> tag. Help please!
Furthermore, you may wish to add custom tags. Like, if the user enters <highlight>some text
</highlight> the resulting HTML would become <span style="background-color: yellow;">some text</span>,
which renders to: some text. In fact, jperkins007
asked this exact question via an ASPMessageboard.com
post. This article will examine how to easily convert from a custom markup language to HTML using
regular expressions! (If you are new to regular expressions I highly recommend that you first read:
An Introduction to Regular Expression with VBScript.)
Finding Delimited Text
This problem essentially boils down to finding delimited text. If we have a custom tag like highlight,
we want to find all instances of <highlight> with some text inbetween it, followed by
</highlight>. Fortunately there's an existing article on 4Guys already that answers this
question: Picking Out Delimited Text with Regular Expressions. Simply put,
the following regular expression pattern is used:
startingDelimiter((.|\n)*?)endingDelimiter
|
The ((.|\n)*?) translates to: "search for the minimum number of characters between
startingDelimiter and endingDelimiter." In our highlight example, the starting and ending
delimiters would be, respectively: <highlight> and </highlight>.
The *? specifies nongreedy repetition matching, and is further discussed in
Picking Out Delimited Text with Regular Expressions. Note that
we don't use just ., but (.|\n); . matches any character except
for the new-line character (\n), hence we have to search for any character or the
new-line character.
Note that this approach requires VBScript version 5.5 or better, since non-greedy pattern matching wasn't made available until then. Again, read Picking Out Delimited Text with Regular Expressions for more information.
Replacing Custom Markup Tags with Valid HTML Tags
Now that we know how to find the custom markup tags using regular expressions, we need to deduce how to
replace such tags with valid HTML tags. Fortunately the regular expression object contains a Replace
method that allows us to hunt through a string for a particular regular expression and replace it with
some string. So, to search for use of our custom highlight tag, we first create our regular expression object
and set its pattern:
|
Great! Now, imagine that we have a variable called userEnteredText, which is a string containing
the message the user posted. At this point, we'd like to replace all instances of the matched pattern with
the proper HTML pattern: <span style="background-color: yellow;">some text</span>.
We can use the Replace method to do this, and back-reference the text found within between the
delimiters by the special character $1:
userEnteredText = oRegExp.Replace(userEnteredText, _
|
Simple enough. Now, note that for each tag in our customized markup language, we will need to reapply the above steps. If we have a small language of custom markup tags, this can be hardcoded, but if you want to allow for a large number of tags, or allow the custom markup language to easily change over time, your best bet is to use an approach I employed when working on my latest project, WebForums.NET. WebForums.NET is an online forum system for ASP.NET Web sites, and, among its many cool features, includes one where the administrator of WebForums.NET can easily define, via a text file, what HTML tags and what custom tags to allow. For example, an administrator could have a text file like:
<highlight><CONTENTS></highlight> <span style="background-color: yellow;"><CONTENTS></span> <important><CONTENTS></important> <b style="font-size: 24pt;"><CONTENTS></b>
And that would replace all highlight tags with the span code we examined earlier, and all
important tags with bold tags with larger fonts. To convert the user's post containing the custom
markup to standard HTML, I open the file and loop through the contents, systematically performing the code
snippet shown above... I'll leave it at that, and leave the implementation as an exercise to the reader! :)
Conclusion
This article demonstrated how to allow a user to post a message in a customized markup language, and have that
language translate to HTML. Of course, this approach could be done via XML using XSL as a translation language,
and I encourage you to explore that avenue as well. Note that this article has a lot of pre-reading
material, so hopefully you took the time to read An Introduction to Regular Expression with VBScript,
if needed, and Picking Out Delimited Text with Regular Expressions. Also,
you can find out more information about regular expressions at the
Regular Expressions Article Index.
Happy Programming!




