By placing a file called "
robots.txt
"
at the top level of a web site, a site administrator can control where
robots can go.
To exclude all robots, the
robots.txt
file looks like
this:
# Prevent all robots from visiting this site:
User-agent: *
Disallow: /
To exclude just one directory (and its subdirectories), say, the
/images/
directory,
# Prevent all robots from visiting the /images directory:
User-agent: *
Disallow: /images/
Web site administrators can allow or disallow specific robots from visiting
part or all of their site. The robot collecting data to be archived
identifies itself as
ia_archiver
, and so to allow ia_archiver to visit (while preventing all others),
# robots.txt --- Exclude search engines but allow crawling
User-agent: *
Disallow: /
# Let Alexa Internet retain a copy.
User-agent: ia_archiver
Disallow:
to prevent ia_archiver from visiting (while allowing all others),
# robots.txt --- index my site for search but don't archive it
User-agent: *
Disallow:
# Tell Alexa Internet not to keep a copy
User-agent: ia_archiver
Disallow: /
More about robot exclusions:
Using the NOINDEX, NOARCHIVE, and NOFOLLOW tags. (Excerpted from
WWW
Frequently Asked Questions)
Sometimes you cannot make a /robots.txt file, because you don't administer
the entire server. All is not lost: there is a new
standard
for using HTML META tags to keep robots out of your documents.
The basic idea is that if you include a tag like:
<META NAME="ROBOTS" CONTENT="NOINDEX">
or
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
in the "<HEAD>"
element of your HTML document, that document will not be archived, and no
links on the page will be followed. The purpose of the NOARCHIVE tag is to
allow content developers to permit indexing but forbid archiving. If you
include:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
the links in that document will not be followed by the robot.