PmWiki versions 0.6.0 and later already have built-in handling for most robots. This information is provided for earlier releases of PmWiki.
One good day GoogleBot will visit your newly created PmWiki-web and will index the whole set-up from a-z, literally.
"That's fine", I hear you say, "let GoogleBot index the whole lot and make my site the number one site in the Universe." Well, think again.
What you
do want is for GoogleBot to index the 'regular' pages, like
www.mysite.com/pmwiki/myfirstpage
and, of course,
www.mysite.com/pmwiki/mysecondpage
. This way, when someone types a specific word in Google that appears on those pages, your PmWiki page would show up.
However, what you
don't want is a complete archive of every
www.mysite.com/pmwiki/myfirstpage?action=edit
, or any other PmWiki-command directly accessible through Google. The reason that you don't want this is because these commands are triggered the moment a visitor clicks the link to get to your site. And end up in an edit-screen or change-log.
There are two ways to prevent this scenario from happening. The first one is easy and consists of creating a robots.txt file in the root of your website (i.e.
www.mysite.com/robots.txt
). The other way is to use meta-tags that you program in your local.php script.
The robots.txt approach
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting). The method used to exclude robots from a server is to
create a file on the server which specifies an access policy for robots [
http://www.robotstxt.org/wc/norobots.html]. This file must be accessible via HTTP on the local URL "/robots.txt".
To prevent GoogleBot accessing some of the unwanted pagelinks, you would put the following statements in the
robots.txt
file:
User-agent: Googlebot
Disallow: */main/allrecentchanges$
Disallow: */pmwiki*
Disallow: */search*
Disallow: *recentchanges*
Disallow: *action=*
However, now PmWiki will by default include special meta-information in the pages that it returns when edit and diff actions are performed that instructs search engines to neither index the page, nor follow any links in it. This removes the need for some of the lines in the
robots.txt
file above.
<< QAMarkup? | PmWiki.DocumentationIndex | TroubleShooting >>