Did you know that you can instruct various search engines to not index certian parts of your web site? Perhaps you have a premium content section that you don't want to have people access directly, or you have a certain set of pages that you don't want accessed until a previous page has been visited. To help accomodate these criteria you can use techniques illustrated in Simple Authentication and Two Ways to Protect your ASP Pages. You should also instruct search engine robots not to index these particular subsections of your Web site.
What, exactly, is a robot?
A robot is a program that runs automatically, with a specific task, usually to scour information.
A Web robot, as defined by the
Web Robots FAQ,
"is a program that automatically traverses the Web's hypertext structure by retrieving a document, and
recursively retrieving all documents that are referenced." At any given time, your web site might be visited by
a Web robot. There are many Web robots out there, poking through your Web site's documents. The Web robots
that most Web developers are concerned about, though, are those from the large search engine sites.
Instructing Robots the Visit your Site
When a search engine robot arrives at your site, it searches for a file called robots.txt in your
root directory. robots.txt should contain information on what directories you don't want the
search engine robot to search. In fact, you can be very specific, indicating that certain search engines
can index certain parts of your site, while others can't!
The remainder of the article will show a quick overview of how to use the robots.txt file to
instruct various robots on how to index your site.
For a full description of the robots.txt standard, be sure to check out this very informative
site:
The following commands can be used in the robots.txt to exclude robots from indexing particular
parts of your Web site:
- One of more
User-Agentlines. TheUser-Agentcan contain the name of a specific robot, or an asterisk to include all robots.
- One of more
Disallowlines after eachUser-Agentline. TheDisallowline instructs the robot what files or directories it is not allowed to visit.
- In the
robots.txtfile, use a pound sign (#) to denote a comment.
So, here is an example of a valid robots.txt file that no robots should visit any URL starting
with /stayAwayFromHere:
|
Note that the asterisk in the User-agent line serves as a message to all robots. To
disallow a particular robot from a particular page or directory, you can specify the robot name
in the User-Agent line. For a complete list of robots and their User-agent names, see
The Web Robots Database.
Here is one more example of a robots.txt that prevents any robot from visiting
indexing any URL on your site:
|
Using META Tags Instead of robots.txt
Finally, know that you can place Robot information in the META tag of a particular HTML web page.
Using the META tag approach, you can inform a robot what to do when it reaches a Web page.
Note that not all robots support the META tag instructions, though. For more information be sure
to read the HTML Author's Guide
to the Robots META tag.
Happy Programming!




