HowTo: Robot Exclusion

By placing a file called "robots.txt" at the top level of a web site, a site administrator can control where robots can go.

To exclude all robots, the robots.txt file looks like this:

    # Prevent all robots from visiting this site:
    User-agent: *
    Disallow: /

To exclude just one directory (and its subdirectories), say, the /images/ directory,

    # Prevent all robots from visiting the /images directory:
    User-agent: *
    Disallow: /images/

Web site administrators can allow or disallow specific robots from visiting part or all of their site. The robot collecting data to be archived identifies itself as ia_archiver, and so to allow ia_archiver to visit (while preventing all others),

    # robots.txt --- Exclude search engines but allow crawling
    User-agent: *
    Disallow: /
    # Let Alexa Internet retain a copy.
    User-agent: ia_archiver
    Disallow:

to prevent ia_archiver from visiting (while allowing all others),

    # robots.txt --- index my site for search but don't archive it
    User-agent: *
    Disallow: 
    # Tell Alexa Internet not to keep a copy
    User-agent: ia_archiver
    Disallow: /

More about robot exclusions:

Using the NOINDEX, NOARCHIVE, and NOFOLLOW tags. (Excerpted from WWW Frequently Asked Questions)

Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents.

The basic idea is that if you include a tag like:

    <META NAME="ROBOTS" CONTENT="NOINDEX">

    <META NAME="ROBOTS" CONTENT="NOARCHIVE">

in the "<HEAD>" element of your HTML document, that document will not be archived, and no links on the page will be followed. The purpose of the NOARCHIVE tag is to allow content developers to permit indexing but forbid archiving. If you include:

    <META NAME="ROBOTS" CONTENT="NOFOLLOW">

the links in that document will not be followed by the robot.