<cbsWiki> - PmWiki.Robots

PmWiki versions 0.6.0 and later already have built-in handling for most robots. This information is provided for earlier releases of PmWiki.

One good day GoogleBot will visit your newly created PmWiki-web and will index the whole set-up from a-z, literally.

"That's fine", I hear you say, "let GoogleBot index the whole lot and make my site the number one site in the Universe." Well, think again.

What you do want is for GoogleBot to index the 'regular' pages, like www.mysite.com/pmwiki/myfirstpage and, of course, www.mysite.com/pmwiki/mysecondpage. This way, when someone types a specific word in Google that appears on those pages, your PmWiki page would show up.

However, what you don't want is a complete archive of every www.mysite.com/pmwiki/myfirstpage?action=edit, or any other PmWiki-command directly accessible through Google. The reason that you don't want this is because these commands are triggered the moment a visitor clicks the link to get to your site. And end up in an edit-screen or change-log.

There are two ways to prevent this scenario from happening. The first one is easy and consists of creating a robots.txt file in the root of your website (i.e. www.mysite.com/robots.txt). The other way is to use meta-tags that you program in your local.php script.

The robots.txt approach

In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting). The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots [http://www.robotstxt.org/wc/norobots.html]. This file must be accessible via HTTP on the local URL "/robots.txt".

To prevent GoogleBot accessing some of the unwanted pagelinks, you would put the following statements in the robots.txt file:

    User-agent: Googlebot
    Disallow: */main/allrecentchanges$
    Disallow: */pmwiki*
    Disallow: */search*
    Disallow: *recentchanges*
    Disallow: *action=*

However, now PmWiki will by default include special meta-information in the pages that it returns when edit and diff actions are performed that instructs search engines to neither index the page, nor follow any links in it. This removes the need for some of the lines in the robots.txt file above.

<< QAMarkup? | PmWiki.DocumentationIndex | TroubleShooting >>

PmWiki: Robots

The robots.txt approach