Did you know that you can instruct various search engines to not index certian parts of your web site? Perhaps you have a premium content section that you don't want to have people access directly, or you have a certain set of pages that you don't want accessed until a previous page has been visited. To help accomodate these criteria you can use techniques illustrated in Simple Authentication and Two Ways to Protect your ASP Pages. You should also instruct search engine robots not to index these particular subsections of your Web site.
What, exactly, is a robot?
A robot is a program that runs automatically, with a specific task, usually to scour information. A Web robot, as defined by the Web Robots FAQ, "is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced." At any given time, your web site might be visited by a Web robot. There are many Web robots out there, poking through your Web site's documents. The Web robots that most Web developers are concerned about, though, are those from the large search engine sites.
Instructing Robots the Visit your Site
When a search engine robot arrives at your site, it searches for a file called
robots.txt in your
robots.txt should contain information on what directories you don't want the
search engine robot to search. In fact, you can be very specific, indicating that certain search engines
can index certain parts of your site, while others can't!
The remainder of the article will show a quick overview of how to use the
robots.txt file to
instruct various robots on how to index your site.
For a full description of the
robots.txt standard, be sure to check out this very informative
The following commands can be used in the
robots.txt to exclude robots from indexing particular
parts of your Web site:
- One of more
User-Agentcan contain the name of a specific robot, or an asterisk to include all robots.
- One of more
Disallowlines after each
Disallowline instructs the robot what files or directories it is not allowed to visit.
- In the
robots.txtfile, use a pound sign (
#) to denote a comment.
So, here is an example of a valid
robots.txt file that no robots should visit any URL starting
Note that the asterisk in the
User-agent line serves as a message to all robots. To
disallow a particular robot from a particular page or directory, you can specify the robot name
User-Agent line. For a complete list of robots and their User-agent names, see
The Web Robots Database.
Here is one more example of a
robots.txt that prevents any robot from visiting
indexing any URL on your site:
META Tags Instead of
Finally, know that you can place Robot information in the
META tag of a particular HTML web page.
META tag approach, you can inform a robot what to do when it reaches a Web page.
Note that not all robots support the
META tag instructions, though. For more information be sure
to read the HTML Author's Guide
to the Robots META tag.