The Difficult Case of Robots.txt

A good web administrator will keep a tab on the Google Webmaster Tools. If I did it more regularly, I would have noticed that many of my web pages were not indexed. Before I can be in prominent place in the search results, I must be indexed by Google or any other search engine. I had problem with my website:

  Submitted: 714 pages
  Indexed: 200 pages

It never went above the 200 pages. Why? It turns out that I didn't read carefully Google's notes. I read too much on the web. Here are the requirements:

  1. Tell Google and other search engines where's your sitemap that will contain all the web pages to be indexed. This doesn't mean that Google will actually index them, but that's what you'd like to be indexed
  2. Tell Google which directories to follow.
  3. Tell Google and other search engines which directories to NOT go in.

This done with the robots.txt file.

  • The robots.txt file only indicates your preferences. Google and the other search engines do NOT have to follow your preferences.
  • The robots.txt file does not provide any security. Usually the web spiders for the big search engine will respect them, but the less popular search engines will not. If you need security, it must be protected by a username/password and/or a key/token file.
  • The web spiders for the big search engine, like Google and Microsoft, do follow the new standard of Allow/Disallow.

What's the robots.txt file

  1. The robots.txt is a simple text file.
  2. The robots.txt must be placed in the root directory of your website. So mine is http://www.foto-biz.com/robots.txt. Any other place and it will not be read. Everybody can view it, including the hackers… To see the robots.txt file from any website, just type website/robots.txt.
  3. The robots.txt is processed from top to bottom.
  4. If the web spider finds it's own section, it will not read any other section, including any wild card section.
  5. The robots.txt must have a direction to your sitemap, which can be placed anywhere on your website.
  6. The robots.txt should first deal with the disallow, then with the allows.
  7. The robots.txt should repeat all of your allows & disallows in each section for each web spider.
  8. If you have using images, the robots.txt should have a section for Googlebot-images
  9. If you have Google ads, the robots.txt should have a section for Mediapartners-Google and Adsbot-Google.

For a full list of all that comes to your website with their identification see: [User Agents.org][4]. There is more than a thousand of spiders, robots, crawler, and browsers.

BTW, since February 2010, Yahoo is out of the picture as far as search in concern. Yahoo did a deal with Microsoft that provides the search results for all of Yahoo's sites.

This doesn't mean that Google will index all of your pages tomorrow, but you will see it creep up, little by little. Depending on the number of pages and how often you update your website, it could take from a couple of weeks to a couple of month.

Here's my robots.txt file

Sitemap: http://www.foto-biz.com/sitemap.xml.gz

  User-agent: baiduspider     # ask them to go away, and don't bother indexing me
  Disallow: /
  User-agent: naverbot        # ask them to go away, and don't bother indexing me
  Disallow: /
  User-agent: yeti 
  Disallow: /
  User-agent: asterias
  Disallow: /

  User-agent: Googlebot
  Disallow:                   # allow google
  # To block access to all URLs that include a question mark (?) 
  # any URL that begins with your domain name, followed by any 
  # string, followed by a question mark
  Disallow: /*?
  Disallow: *RecentChanges$      #anything that ends with RecentChanges
  Disallow: /cookbook/
  Disallow: /pub/
  Disallow: /local/
  Disallow: /docs/
  Disallow: /scripts/
  Disallow: /wikilib.d/
  Disallow: /wiki.d/Site/
  Disallow: /wiki.d/SiteAdmin/
  Disallow: /wiki.d/Site*
  Disallow: /PmWiki/
  Disallow: /Category.GroupFooter
  Allow: /wiki.d/Biz/
  Allow: /wiki.d/Canon/
  Allow: /wiki.d/Foto-biz/
  Allow: /wiki.d/Foto-Biz/
  Allow: /wiki.d/Lightroom/
  Allow: /wiki.d/Main/
  Allow: /wiki.d/Seo/

  User-agent: Googlebot-Image
  Disallow:
  Disallow: /*?
  Disallow: *RecentChanges$
  Disallow: /cookbook/
  Disallow: /pub/
  Disallow: /local/
  Disallow: /docs/
  Disallow: /scripts/
  Disallow: /wikilib.d/
  Disallow: /wiki.d/Site/
  Disallow: /wiki.d/SiteAdmin/
  Disallow: /wiki.d/Site*
  Disallow: /PmWiki/
  Disallow: /Category.GroupFooter
  Allow: /wiki.d/Biz/
  Allow: /wiki.d/Canon/
  Allow: /wiki.d/Foto-biz/
  Allow: /wiki.d/Foto-Biz/
  Allow: /wiki.d/Lightroom/
  Allow: /wiki.d/Main/
  Allow: /wiki.d/Seo/

  User-Agent: Mediapartners-Google
  Disallow:
  Disallow: /*?
  Disallow: *RecentChanges$
  Disallow: /cookbook/
  Disallow: /pub/
  Disallow: /local/
  Disallow: /docs/
  Disallow: /scripts/
  Disallow: /wikilib.d/
  Disallow: /wiki.d/Site/
  Disallow: /wiki.d/SiteAdmin/
  Disallow: /wiki.d/Site*
  Disallow: /PmWiki/
  Disallow: /Category.GroupFooter
  Allow: /wiki.d/Biz/
  Allow: /wiki.d/Canon/
  Allow: /wiki.d/Foto-biz/
  Allow: /wiki.d/Foto-Biz/
  Allow: /wiki.d/Lightroom/
  Allow: /wiki.d/Main/
  Allow: /wiki.d/Seo/

  User-Agent: Adsbot-Google
  Disallow:
  Disallow: /*?
  Disallow: *RecentChanges$
  Disallow: /cookbook/
  Disallow: /pub/
  Disallow: /local/
  Disallow: /docs/
  Disallow: /scripts/
  Disallow: /wikilib.d/
  Disallow: /wiki.d/Site/
  Disallow: /wiki.d/SiteAdmin/
  Disallow: /wiki.d/Site*
  Disallow: /PmWiki/
  Disallow: /Category.GroupFooter
  Allow: /wiki.d/Biz/
  Allow: /wiki.d/Canon/
  Allow: /wiki.d/Foto-biz/
  Allow: /wiki.d/Foto-Biz/
  Allow: /wiki.d/Lightroom/
  Allow: /wiki.d/Main/
  Allow: /wiki.d/Seo/

  User-agent: * 
  Disallow: 
  Disallow: /*?
  Disallow: *RecentChanges$
  Disallow: /wiki.d/Site*
  Disallow: /cookbook/
  Disallow: /pub/
  Disallow: /local/
  Disallow: /docs/
  Disallow: /scripts/
  Disallow: /wikilib.d/
  Disallow: /wiki.d/Site/
  Disallow: /wiki.d/SiteAdmin/
  Disallow: /PmWiki/

  # All of these robots are processed by the previous 
  # entry the user agent: *
  # User-agent: MSNBot       
  # User-agent: Slurp
  # User-agent: Teoma
  # User-agent: twiceler
  # User-agent: Gigabot
  # User-agent: Scrubby
  # User-agent: Robozilla
  # User-agent: Nutch
  # User-agent: ia_archiver
  # User-agent: yahoo-mmcrawler
  # User-agent: psbot
  # User-agent: yahoo-blogs/v3.9

Please note that:

  1. Each user agent has his own section.
  2. All of my sections are repeated for each user agent, including the wildcard * for everybody. That's for the one that haven't found a section for them.
  3. You cannot have a specific section for a user agent and the common stuff in the user-agent: *. What's in the user-agent: * will NOT be read by a web spider that has it's own section.

A week ago: Google indexed: 200 pages, by Sunday Google indexed 400 pages, and on Monday morning it was up to 422 pages.