By placing a file called "
at the top level of a web site, a site administrator can control where
robots can go.
To exclude all robots, the
file looks like
# Prevent all robots from visiting this site:
To exclude just one directory (and its subdirectories), say, the
# Prevent all robots from visiting the /images directory:
Web site administrators can allow or disallow specific robots from visiting
part or all of their site. The robot collecting data to be archived
identifies itself as
, and so to allow ia_archiver to visit (while preventing all others),
# robots.txt --- Exclude search engines but allow crawling
# Let Alexa Internet retain a copy.
to prevent ia_archiver from visiting (while allowing all others),
# robots.txt --- index my site for search but don't archive it
# Tell Alexa Internet not to keep a copy
More about robot exclusions:
Using the NOINDEX, NOARCHIVE, and NOFOLLOW tags. (Excerpted from
Frequently Asked Questions
Sometimes you cannot make a /robots.txt file, because you don't administer
the entire server. All is not lost: there is a new
for using HTML META tags to keep robots out of your documents.
The basic idea is that if you include a tag like:
<META NAME="ROBOTS" CONTENT="NOINDEX">
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
in the "<HEAD>"
element of your HTML document, that document will not be archived, and no
links on the page will be followed. The purpose of the NOARCHIVE tag is to
allow content developers to permit indexing but forbid archiving. If you
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
the links in that document will not be followed by the robot.