The XHTML Way
By Vlad Alexander
Introduction
If you're starting a new Web project or enhancing an existing Web site, you face the same dilemma: Which version of
HTML should I use? Many developers shy away from the latest Web standards because they think it's an either/or
choice and aren't ready to commit 100% to the new standards. This article will give you the background you need to
make an informed decision and show you how you can gradually transition to the latest Web standards by successfully
combining HTML 4 with the latest XHTML.
A Brief History
I don't like to look back at old technology, but putting XHTML into its proper context requires a quick trip down
memory lane. The timeline below shows key milestones in the evolution of HTML. With each milestone, there have
been significant changes to the standard. In year 2000, a change to the standard caused a change in the name
(XHTML) and version number (1.0). Some saw this as the death of HTML and the birth of a new markup syntax. Others
(including myself) see this as just another milestone in the evolution of HTML.
HTML 4
HTML 4 is the markup syntax we are all familiar with. Along with new features like scripting, HTML 4 introduced
Cascading Style Sheets (CSS) and made it easier to write more accessible code for users with disabilities. Since
the language was very easy to write, a wave of WYSIWYG editors sprung up permitting non-technical users to author
rich content for the first time. However, because the language was so easy to write, it also encouraged mistakes
and - in their rush to imitate word processors - WYSIWYG editors generated markup that was considered "dirty."
The problem stems from the fact that HTML, itself, does not impose any formatting or structuring guidelines. Add this with the fact that browsers will gleefully render sloppy and malformatted HTML, and you have yourself a recipe for disaster. Instead of tackling the problem at its source and making sure that the markup these editors generated was clean, tools like HTML Tidy were used to clean up dirty markup after the fact.
XHTML 1.0
What was missing in HTML 4 was a sense of professionalism - a mechanism that would enforce the rules of the language
and prevent WYSIWYG authoring tools from generating bad code in the first place. XML,
a sister standard to HTML, provided this mechanism. If the syntax of an XML document is incorrect, if tags are
improperly nested or if closing tags are missing, the structure of the XML document is considered not valid.
When the rigorous standards of XML were applied to HTML it reformulated HTML and what emerged was XHTML 1.0.
(For more information on XML and its formatting rules, be sure to read the FAQ What
is XML?)
Confusingly, XHTML 1.0 came in three flavors and you could specify which flavor of the language you were using by inserting a line in the beginning of the document.
- "XHTML 1.0 Strict" declared elements like
<font>and<basefont>to be outdated ("deprecated") and allowed formatting only through Cascading Style Sheets – either external, embedded or inline CSS. - "XHTML 1.0 Transitional" was less strict and retained most of the formatting model of HTML, including the use of
the
<font>element. - "XHTML 1.0 Frameset" was similar to XHTML 1.0 Transitional but also permitted the use of frames. (Frames are used to partition a browser's window into sections with each section displaying content from a different Web page)
One advantage of XHTML 1.0 was that it displayed pages in Web browsers much faster than HTML 4 pages, the difference being most apparent in very long documents. This was due to the fact that XHTML 1.0 followed the rules of XML, so parsing Web pages became much easier and required less CPU resources. Also, browsers did not need to clean up the structure of code before displaying the Web page, because Web pages written in XHTML 1.0 were well formed. Some WYSIWYG editors that natively generated HTML 4 were able to convert their code to XHTML 1.0 using clean-up tools like HTML Tidy.
XHTML 1.1
Apart from loading Web pages into browsers faster, most developers saw few other benefits to adopting XHTML 1.0.
However, XHTML 1.1 offers developers one very significant benefit – it cleanly separates data from formatting. It does
this by deprecating the style attribute and thus eliminating inline formatting. Instead, formatting is
permitted only using CSS, which are referenced exclusively through the class attribute.
For developers of medium to large Web sites, the benefits of separating data from formatting are huge. First, in its "raw" state data becomes immediately more available to a wide range of devices and applications. Second, separating data from formatting has significant advantages for Web design. For instance, if you have ever maintained a Web site with many contributing authors, you know that some can't tell the different between Arial and Times Roman. Some like 11 point font while others prefer putting everything in 14 point. And if you give a non-technical user a color-picker, you can be sure that no color on the palette will go unused. Since XHTML 1.1 does not permit random inline formatting of this type, but regulates presentation through external or embedded CSS, it is much easier to maintain the common look and feel of Web sites. Modifying the look and feel of entire Web pages or web sites is also much simpler. Both can be achieved by making a few simple changes to one or more CSS files.
True, XHTML 1.1 requires a change in the way that Web pages are served, but the change is slight. It involves the
"media type" information that is normally returned to the browser by the Web server when a page is requested.
For HTML Web pages, the media type is text/html. For XHTML 1.1 Web pages, the media type should be
application/xhtml+xml. For the many browsers that don't yet recognize this new media type, a W3C Note
allows the continued used of the old text/html media type. However, rather than serving up XHTML 1.1
with the old media type, it's better practice to keep the old media type and serve up XHTML 1.1 content as XHTML 1.0
Strict. Do this by changing the doctype.
Content-managed Web Sites
Before deciding which version of HTML is right for you, let's look at how content-managed Web sites are built.
Typically, they are built using a set of layout templates (ASP, PHP, ColdFusion, etc). These templates provide the
general look and feel and navigation for the site, with placeholders for content (script that fetches data). When
a site visitor requests a page, the layout template is combined with the data to produce the HTML Web page (see the
diagram below).
This is a solid and time-tested approach and virtually all content-managed sites are built in this way. Some store content in the database, others store it in XML documents on the file system or in plain text files, but the approach is essentially the same. However, over time, content usually needs to be re-purposed, syndicated and inserted into different page layouts. So while the way in which data is presented will change over time, content itself needs to remain highly available to any layout that needs it. The diagram below demonstrates this point.
Only content that is free from formatting can be easily re-used in this way. In theory, HTML 4 right through to XHTML 1.1 supports the separation of data from formatting, but only XHTML 1.1 actually enforces it. The reality is therefore that in the real world of content authoring, most WYSIWYG editors still generate code that fuses data and formatting together. This makes data more difficult to parse and reuse. Take for example this simple illustration. Let's say that an author decides to present people's names, within a news article, in the color green. This will generate the following code:
<font color="green">John Smith</font>
|
or
<span style="color: green">John Smith</span>
|
Problem: what if another Web site's policy is to display people's names in blue? On the surface, the solution seems easy – a simple "search and replace" on the word "green" within a color or a style attribute. But what if green is also being used to colorize something else? How confident would you be that your search and replace has not mistakenly replaced something it was not supposed to?
A far better approach is to author content in such a way that the data is not compromised by inline formatting - by using an external or embedded CSS. For example:
<span class="person">John Smith</span>
|
Each Web site that uses the data "John Smith" is now free to define the CSS rule that formats the person
class in a way that meets its own common look and feel policy. For example:
span.person {color: blue}
|
Taking this one step further, what if a Web site for some reason wants to revert to using the <font>
tag, instead of using CSS? Even this is quite easy to do by using an XSLT rule that transforms
<span class="person"> to a <font> tag:
<xsl:template match="span[@class = 'person']">
|
This example reveals one self-evident truth: it is possible to convert semantically rich markup to semantically barren markup, but not vice versa.
Fortunately, there are XHTML 1.1-compliant WYSIWYG editors, ones that enforce the separation of content and style. In Part 2 we'll look at one XHTML WYSIWYG editor in particular, and look at some general rules you can apply to your HTML markup today to help prepare it for a future of XHTML.




