Have you ever performed a search on 4Guys? If not, take a moment to visit the search page and try a search. I've received a number of questions from users asking how to search a Web site, so I thought it would make a great article to describe how I search 4Guys!
Originally, when 4Guys was a lot smaller and received less traffic on a daily basis, my search engine used the FileSystemObject to search through the text of each file on the Web server whenever a search was performed. In fact, I wrote up an article on how to do this back in December, 1999: Searching Through the Text of Each File on a WebSite.
There are two common techniques used for content-rich sites like 4Guys. One method is to have the contents of each article stored in a database and to create a single ASP page to display each of these. Sites like ASPWatch.com and SQLTeam.com use this technique. The other approach is to have a Web page for each article. Sites like 4Guys and ASP101.com follow this model.
On 4Guys, each article exists as its own file; this makes a textual search using the FileSystemObject a plausible solution. However, as the number of articles and visitors on 4Guys grew (as of 11/28/00 there are over 725 total articles and over 100,000 daily page views), the FileSystemObject approach slowed down considerably. I looked at using Index Server, but had fits getting it setup; also, I was wanting to create some sort of custom-database repository of the content on 4Guys. I then looked at using a product like XCache. For those unfamiliar with XCache, it is an application that allows a Web master to build a database of the site's content. Then, on a regular schedule, XCache will go through the database and turn it into a series of static HTML pages. This approach is useful for enhancing performance, since you remove all database calls (and all ASP execution time) from the site.
Rather than go with any of these solutions, I decided to create my own. Since I already (at the time) had about 300 articles (and I was very comfortable with the process I had for adding new content to the site), I didn't want to make any changes that would disrupt existing content (or my methodologies for adding new articles). Therefore, I decided to sort of do the inverse of what XCache does: rather than creating a database of my site's content and scheduling the creation of a static version for the site, I decided to write a script that would my existing (and future) static content, and build up a database of this information.
With that in mind, I created a database table,
tblArticleIndex, with the following format:
For each article on 4Guys, I'd add a row to the table. I automated this process by
creating a simple script that would iterate through the ASP pages that comprised each article on 4Guys
and use the FileSystemObject to populate each of the columns. I then used the task scheduler to schedule this
script to execute once a day, late at night. (Each time it ran, it obliterated all of the contents of the
tblArticleIndex table and then rebuilt the entire table by iterating through all of the articles.
While this may seem like a waste of time/resources, I've found it to be no big deal, seeing as the entire operation
takes under fifteen seconds. (So, yes, for ~15 seconds in the middle of the night, searching the 4Guys site
may not return all of the results that are really there (since they are still being populated into the
The script that builds up the database each night borrows a lot of its code from Searching Through the Text of Each File on a WebSite. The same code presented in Part 3 of that article is used in the database-building script. Some code has been added, though, to insert a row into the database for each article found. I am not going to go into detail on how the database-building script works, for I think it is pretty self-explanatory if you've thoroughly read Searching Through the Text of Each File on a WebSite. The database-building script's source can be viewed here.
Please do take a moment and check out the database-building script.
It is important to realize that each article on 4Guys has an HTML header. Go ahead and do a View/Source on
this article and you will see what I mean. The title of the article is wedged between a
--> pair of delimiters, while the description for an article is
slapped between a pair of
--> delimiters. (This stems back from
the day when a search blasted through the entire contents of each file - the reason the titles and descriptions
were stored at the top of the files was so that when a match was found when searching the contents of the
file, I could intelligently list the title and description of the article in the search results.) Note that the
database-building script picks out the included title and description and stuffs those in the
Description columns, respectively.
Now that we have the
tblArticleIndex table built up on a nightly basis, all we need is an ASP
page that will accept some search terms and intelligently search through this database table, returning
paged results. We'll examine this page,
Part 2 of this article!