Robot Control Code Generation Tool
If you know of a robot that should be added to this list please contact us and we will verify and add it.
Latest Update: Naver Crawler added. Googlebot-Mobile added.
Newest Additions: XML Sitemap Auto Discovery directive
Introduction to Robots.txtThe robots.txt is a very simple text file that is placed on your root directory. An example would bewww.yourdomain.com/robots.txt. This file tells search engine and other robots which areas of your site they are allowed to visit and index. You can ONLY have one robots.txt on your site and ONLY in the root directory (where your home page is): OK: www.yourdomain.com/robots.txt BAD - Won't work: www.yourdomain.com/subdirectory/robots.txt All major search engine spiders respect this, and naturally most spambots (email collectors for spammers) do not. If you truly want security on your site, you will have to actually put the files in a protected directory, rather than trusting the robots.txt file to do the job. It's guidance for robots, not security from prying eyes. What does a Robots.txt look like?At its most simple, a robots.txt file looks like this: User-agent: * This one tells all robots (user agents) to go anywhere they want (disallow nothing). This one, on the other hand, keeps out all compliant robots: User-agent: * As you can see, the only difference between them is a single slash ( "/" ). But if you accidentally use that slash when you didn't mean to, you could find your search engine rankings disappear. Be very careful. One important thing to know if you are creating your own robots.txt file is that although the wildcard (*) is used in the user-agent line, it is not allowed in the disallow line. For example, you can't have something like: # Broken robots.txt - can't use the * symbol in the disallow line, even if you really want to and it makes sense to have one (Google and MSN are an exception to this - more information below) User-agent: * Here is the official information on the subject: RobotsTxt.org You may also be interested in: and Robot Cop (Server module that enforces bot behaviour) UPDATE: If you use Google Sitemaps (and you should), they have now included a robots.txt validator in it - which will make certain that your robots.txt file is understood properly by Google. Major Known Spiders / CrawlersGooglebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ), Twiceler (Cuil) Search Engine Crawler Specific CommandsGoogle allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry: User-agent: Googlebot-Image This applies to both googlebot and google-image spiders. Source: http://www.google.com/webmasters/remove.html Apparently does NOT support the crawl-delay command. YahooYahoo also has a few specific commands, including the: Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server. Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like: User-agent: Yahoo-Blogs/v3.9 Source: http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html Ask / TeomaSupports the crawl-delay command. MSN SearchSupports the crawl-delay command Also allows wildcard behavior User-agent: msnbot (the "$" is required, in order to declare the end of the file) Examples: User-agent: msnbot Source: http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm CuilSupports the crawl-delay command. Source: http://www.cuil.com/info/webmaster_info/ Why do I want a Robots.txt?There are several reasons you would want to control a robots visit to your site:
Robots.txt FAQ - Issues, Facts and FictionBy itself, a robots.txt file is harmless and actually beneficial. However, its job is to tell a search engine to keep away from parts of your website. If you misconfigure it, you can accidentally prevent your site from being spidered and indexed. This has happened to people both due to an error in the robots.txt file and also after a site redesign where the directory structure of the site has changed and the robots.txt has not been updated. Always check the robots.txt after a major site redesign. A robots.txt file and, for that matter, the robots metatag (related: free robots meta tag generator), has NO EFFECT on speeding up the spidering and indexing of a website, and no effect of the depth or breadth of the spidering of a site. You cannot issue a search engine spider a command to do something - you can only tell it not to do something. Some people get confused between "crawler", "robot" and "spider":
Security Issue: A robots.txt is not intended to provide security for your website - humans ignore them. Additionally, there is actually an additional possible security issue with them. Lets say you have a secret directory in your site called "secretsauce'. You don't want it spidered so you add this directory to your robots.txt. The problem now is that anyone can look up your robots.txt file and see that you don't want people looking at that directory. Obviously, if you were a hacker, this would be your first stop. Additionally, if the path you were excluding was "/secretfiles/secretsauce/" the same hacker now knows that you have another directory called "secretfiles", as well. It's never a good idea to tell a hacker details about your site structure and design. If you are trying to keep people away from information, you need to use real file and folder level security on your site, which will prevent robots from visiting just like people, even if the robots.txt file says it's ok. I recommend you set your robots.txt to only deal with non-critical and normal directories, such as images, cgi-bin, etc and then use file security for the rest. That way, even though the robots are not specifically excluded from the folders and files, they are effectively excluded by the the file permissions. Only use robots.txt (and robots metatags) to exclude files, pages and directories that are intended to be available to people but not to robots, such as duplicate pages, test pages and demos. Rule of thumb: If you want to restrict robots from entire websites and directories, use the robots.txt file. If you want to restrict robots from a single page, use the robots metatag. If you are looking to restrict the spidering of a single link, you would use the link "nofollow" attribute.
Unless otherwise noted, all articles written by Ian McAnerin, BASc, LLB. Copyright © 2002-2008 All Rights Reserved. Permission must be specifically granted in writing for use or reprinting anywhere but on this site, but we do allow it and don't charge for it, other than a backlink. Contact Us for more information.
|