When you think ASP, think...
Recent Articles
All Articles
ASP.NET Articles
ASPFAQs.com
Message Board
Related Web Technologies
User Tips!
Coding Tips

Sections:
Sample Chapters
Commonly Asked Message Board Questions
JavaScript Tutorials
MSDN Communities Hub
Official Docs
Security
Stump the SQL Guru!
XML Info
Information:
Feedback
Author an Article
Technology Jobs
ASP ASP.NET ASP FAQs Message Board Feedback ASP Jobs
Print this page.
Published: Tuesday, February 01, 2000

Restricting Search Engines Robots that Index your Site


Did you know that you can instruct various search engines to not index certian parts of your web site? Perhaps you have a premium content section that you don't want to have people access directly, or you have a certain set of pages that you don't want accessed until a previous page has been visited. To help accomodate these criteria you can use techniques illustrated in Simple Authentication and Two Ways to Protect your ASP Pages. You should also instruct search engine robots not to index these particular subsections of your Web site.

- continued -

'

What, exactly, is a robot?
A robot is a program that runs automatically, with a specific task, usually to scour information. A Web robot, as defined by the Web Robots FAQ, "is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced." At any given time, your web site might be visited by a Web robot. There are many Web robots out there, poking through your Web site's documents. The Web robots that most Web developers are concerned about, though, are those from the large search engine sites.

Instructing Robots the Visit your Site
When a search engine robot arrives at your site, it searches for a file called robots.txt in your root directory. robots.txt should contain information on what directories you don't want the search engine robot to search. In fact, you can be very specific, indicating that certain search engines can index certain parts of your site, while others can't!

The remainder of the article will show a quick overview of how to use the robots.txt file to instruct various robots on how to index your site. For a full description of the robots.txt standard, be sure to check out this very informative site:

  • The Web Robots Page (very useful: a FAQ, a list of common robots, a comprehensive list of exclusion commands, related web sites, and several articles on the issue! Very impressive!)

    The following commands can be used in the robots.txt to exclude robots from indexing particular parts of your Web site:

    • One of more User-Agent lines. The User-Agent can contain the name of a specific robot, or an asterisk to include all robots.
    • One of more Disallow lines after each User-Agent line. The Disallow line instructs the robot what files or directories it is not allowed to visit.
    • In the robots.txt file, use a pound sign (#) to denote a comment.

    So, here is an example of a valid robots.txt file that no robots should visit any URL starting with /stayAwayFromHere:

    # This file will tell all robots to avoid URLs
    # starting with /stayAwayFromHere
    
    User-agent: *
    Disallow: /stayAwayFromHere
    

    Note that the asterisk in the User-agent line serves as a message to all robots. To disallow a particular robot from a particular page or directory, you can specify the robot name in the User-Agent line. For a complete list of robots and their User-agent names, see The Web Robots Database.

    Here is one more example of a robots.txt that prevents any robot from visiting indexing any URL on your site:

    # Don't index my site!
    
    User-agent: *
    Disallow: /
    

    Using META Tags Instead of robots.txt
    Finally, know that you can place Robot information in the META tag of a particular HTML web page. Using the META tag approach, you can inform a robot what to do when it reaches a Web page. Note that not all robots support the META tag instructions, though. For more information be sure to read the HTML Author's Guide to the Robots META tag.

    Happy Programming!

  • Software Developer / Programmer - Distributed Systems (NYC)
    Next Step Systems
    US-NY-New York

    Justtechjobs.com Post A Job | Post A Resume


    ASP.NET [1.x] [2.0] | ASPMessageboard.com | ASPFAQs.com | Advertise | Feedback | Author an Article