About Awstats Robots

Awstats includes the ability to detect common internet robots, and not report them as traffic in your statistics for your website. It has an excellent collection of robots that it can detect, but if you have a custom robot doing something on your site, or if new robots crop up that you would like to detect, you can make a configuration change to the robots.pm file to detect it.

Getting Robot Definition

In your statistics, the robots usually show up under Unknown for stats such as OS and Web Browser. If you go to the stats for these, you should see something like this:

User agent (4) Last Visit
Hobbit_bbtest-net/4.2.3 01 Sep 2009 - 04:01
Mozilla/4.0 01 Sep 2009 - 03:21
WinHttp 01 Sep 2009 - 02:56
BlackBerry8330/4.3.0_Profile/MIDP-2.0_Configuration/CLDC-1.1_VendorID/105 01 Sep 2009 - 01:21

Some of these may be valid user agents that awstats cannot yet detect, such as the BlackBerry Browser. The robot that I would like to block is Hobbit_bbtest-net/4.2.3, since it checks the website quite often to make sure it is still running. It is important to check this page first to find out how the robot identifies itself before trying to block it.

robots.pm

In order to block our newly found robot, we must edit the robots.pm file. If everything was installed in the default locations, this file should be located at /usr/local/awstats/wwwroot/cgi-bin/lib/robots.pm and should look something like this:

robots.pm
# AWSTATS ROBOTS DATABASE
#-------------------------------------------------------
# If you want to add robots to extend AWStats database detection capabilities,
# you must add an entry in RobotsSearchIDOrder_listx and RobotsHashIDLib.
#-------------------------------------------------------
# $Revision: 1.53 $ - $Author: eldy $ - $Date: 2008/11/15 14:58:01 $
 
# 2005-08-19 Sean Carlos http://www.antezeta.com/awstats.html
#              added dipsie (not tested with real data).
#              added DomainsDB.net http://domainsdb.net/
........

The first few lines tells you how to add the robots. But since that is a little cryptic, and took me 4 tries to get it to work, I thought I'd outline it in a bit more detail here.

Adding Robot

As the file above indicates, we only need to make two changes to the file to add our robot. The first change is in the @RobotsSearchIDOrder_list1 = ( array. The file looks something like this:

robots.pm
# RobotsSearchIDOrder
# It contains all matching criteria to search for in log fields. This list is
# used to know in which order to search Robot IDs.
# Most frequent ones are in list1, used when LevelForRobotsDetection is 1 or more
# Minor robots are in list2, used when LevelForRobotsDetection is 2 or more
# Note: Robots IDs are in lower case, '_', ' ' and '+' are changed into '[_+ ]' and are quoted.
#-------------------------------------------------------
@RobotsSearchIDOrder_list1 = (
# Common robots (In robot file)
'appie',
'architext',
'jeeves',
'bjaaland',
'contentmatch',
'ferret',
'googlebot',
'google\-sitemaps',
'gulliver',
'virus[_+ ]detector',           # Must be before harvest
'harvest',
'htdig',
'linkwalker',
'lilina',
'lycos[_+ ]',
'moget',
'muscatferret',
'myweb',
'nomad',
'scooter',
'slurp',
'^voyager\/',
'weblayers',
# Common robots (Not in robot file)
'antibot',
'bruinbot',
'digout4u',
'echo!',
'fast\-webcrawler',
'ia_archiver\-web\.archive\.org', # Must be before ia_archiver to avoid confusion with alexa
'ia_archiver',
'jennybot',
'mercator',
'netcraft',
'msnbot\-media',
'msnbot',
'petersnews',
'relevantnoise\.com',
'unlost_web_crawler',
'voila',
'webbase',
'webcollage',
'cfetch',
'zyborg',       # Must be before wisenut 
'wisenutbot'
);
@RobotsSearchIDOrder_list2 = (

I'm not so sure about the order, but I thought it should be in list 1, since I want to detect it, and I figured I would put it at the end, so that it doesn't interfere with any other detection. I simply added the entry like this:

robots.pm
'webbase',
'webcollage',
'cfetch',
'zyborg',       # Must be before wisenut 
'wisenutbot',
'Hobbit[_+ ]bbtest\-net/4.2.3'
);
@RobotsSearchIDOrder_list2 = (

Because the identifier had a space in it, between Hobbit and bbtest, I encoded the space as [_+ ] as indicated at the top of the file. Because this works like a regular expression, I also had to escape special characters, like the - with a backslash. To find out more about regular expressions, and which characters to replace, take a look at this.

The second change happens in the %RobotsHashIDLib hash. If you look down the file for something like this:

robots.pm
# RobotsHashIDLib
# List of robots names ('robot id','robot clear text')
#-------------------------------------------------------
%RobotsHashIDLib   = (

And then scroll down a little farther to:

robots.pm
'wwwc','WWWC Ver 0.2.5',
'wz101','WebZinger',
'xget','XGET',
# Other robots reported by users

Add in our new entry after the #Other robots reported by users. So now it will look like this:

robots.pm
'wwwc','WWWC Ver 0.2.5',
'wz101','WebZinger',
'xget','XGET',
# Other robots reported by users
'Hobbit[_+ ]bbtest\-net/4.2.3','xymon',

We add our same detection definition listed above, and then add in a plain text descriptor of the robot. Which in this case I just set as xymon as that is the name of the network monitoring package.

Save the changes to the file, and when awstats runs, it will now detect your robot, and stop placing it under Unknown.

Updating Stats

This change will only affect new runs of awstats, and not the old statistics you already have. It will also affect all awstats profiles running on the server.

If you want to update your old stats to filter out this new robot, you need to dive into the DataDir directory, which is usually set as /var/lib/awstats/. Once here, you can delete1) the stats files here for the months want to rerun. The files are named like this:

awstats{month}{year}.{your profilename}.txt
or for a real example:
awstats052009.www.cornempire.net.text

Deleting these files will delete stats for this month. You should only do this for months that you still have the original log files for, or that you don't mind not having.

When I made this change on my web server, I had stats for July and August, however, because my log rotate file was configured to rotate weekly, and only keep 4 weeks, I only had enough logs to go back one month. So I deleted August's file, and ran awstats again on those old stats. You must keep in mind that the logs must be imported in order from oldest to newest.

To run awstats to import old logs (without changing your configuration) you can run the command like this:

[root@vps /usr/local/awstats/wwwroot/cgi-bin/]# ./awstats.pl -config=www.cornempire.net -LogFile=/var/log/httpd/www.cornempire.net-access_log.2

The -LogFile= switch will temporarily bypass the entry in the config file, and allow you to do a bunch of imports of old files from the command line.

Update: Jan. 21, 2014

I just recently remodified my install to detect ios, and thought I would reapply this patch. I did it a bit differently though. The process is the same, but instead of using the detailed detection string, I instead just used a more generic string. This should make it update proof. Also, we have a new version of Xymon deployed which now identifies itself as xymon and not hobbit.

So instead of adding:

'Hobbit[_+ ]bbtest\-net/4.2.3'

I added:

'xymon'

And in the second part, added:

'xymon','xymon'
1) or better yet, move
awstats/awstatsrobots.txt · Last modified: 2014/01/21 11:34 by cornmaster
CC Attribution-Noncommercial-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0