Readme for analog2.0

Contents


Introduction

This Readme describes analog2.0. For the latest version of analog, see the analog home page.

This program analyses logfiles from WWW servers. It works on most Unix, DOS, Mac and VMS machines. It is designed to be fast and to produce attractive statistics. For more details, see the

For examples of the output see

Sorry about the length of this Readme. It includes documentation on everything the program can do. It's not as complicated as it looks, and you don't have to read all of it before using the program anyway!

This program is freeware, but its use is covered by a licence which is at the bottom of this file. You must agree to the terms of the licence before using the program.


What's new?

This section describes the main new features in each version of analog. If you are using analog for the first time, you can skip this section. If you are upgrading from an old version of analog, you might also like to look at the file Update (which should have come with the program) to see what options have been deleted or have changed in an incompatible way.
2.0 (10-Feb-97)
New native Win32 version.
Wildcards allowed in filenames on Mac.
Ignores browser "-".
1.93beta (18-Jan-97)
New commands BROWALIAS, CONFIGFILE and PROGRESSFREQ.
Form program can now call configuration files.
Form program now uses the default choices if none specified.
Domain report prints correctly in preformatted output.
Specifying +1 and +V2 doesn't crash the program.
+v reports dates correctly.
Trailing dots on hostnames removed.
Second argument to LOGFILE command can't be obliterated by /../
1.92beta (08-Oct-96)
DNS lookups added on Mac.
Netpresenz format understood on Mac.
New languages: Spanish, Italian and Danish.
Extra information when debugging turned on.
*.htm are now pages on all machines.
A few small bugs fixed.
1.91beta4 (13-Jul-96)
Cache file now includes page request information.
DNS bug fixed.
New command DNSHASHSIZE.
Bug in browser reports fixed.
1.91beta3 (09-Jul-96)
BSD/OS compilation bug believed fixed.
Fixed HOSTALIAS which I broke yesterday.
DNS bug (causing too many lookups) identified, although not yet fixed.
1.91beta2 (08-Jul-96)
Some bug fixes (including: HOSTEXCLUDE and CASE INSENSITIVE didn't work properly; selecting "no links" failed on the form; less fussy about what can appear on the form).
Mac version no longer includes source code, so is much shorter.
1.91beta1 (05-Jul-96)
Now DNS code doesn't look up a name twice, even if one is a failed request.
1.91beta (05-Jul-96)
Will now output in any of several languages.
Preformatted output introduced.
New File Type Report.
Can limit the number of rows in the time reports.
Number of requests for pages (as opposed to raw requests) now calculated throughout.
DNS lookup returns, with cacheing across runs.
Logfiles can include wildcards.
Wildcards can include multiple *'s.
Can process case insensitive logfiles.
OUTPUTALIAS commands introduced.
New commands to specify exactly what is included, and what linked, in the request report and referrer report.
FILEALIAS a a and FILEALIAS a b; FILEALIAS b c now work.
New ALLOW options to cancel INCLUDES.
REPSEPCHAR and DECPOINT introduced.
DIRSUFFIX introduced.
Debugging reports number of corrupt lines in other logs.
Hash sizes can now be allocated at run time.
stdin can now be used for any input file, but not for two.
Macintosh version now quits automatically if no warnings have been issued.
Form interface made more secure.
"Mozilla (compatible)" separated out in Browser Summary.
Major internal changes should improve speed.
Code for non-Unix platforms integrated into main code.
"Referrer" spelled correctly.
Licence introduced.
Update file introduced.
Readme updated to include non-Unix instructions.
(19-Apr-96)
First Mac version.
1.9beta6
Two bug fixes (number of bytes was incorrectly reported in some cases, and -v would overwrite the OUTFILE).
Documentation improved.
1.9beta5
More bug fixes...
1.9beta4
One important bug fix (I broke GRAPHICAL OFF in 1.9beta3).
New form cgi options: ch, gr and ou=3.
Code shortened.
(05-Mar-96)
First DOS version.
1.9beta3
Mainly bug fixes and improved documentation.
Browser and referer reports now include failed requests.
The WARNINGS option can now be specified on the form.
1.9beta2
Small bug fixes
1.9beta (06-Feb-96)
Lots of changes. The most important new features are As far as possible the options are backwards compatible with previous versions, but some changes have been necessary.

1.2.6
Minor bug fix; will only affect those with corrupt logfiles.
1.2.5
Minor bug fix for weekly report.
1.2.4
Patch for Spyglass server logfile format.
1.2.3
A couple of bug fixes (wild subdomains sometimes caused crashes).
-v option now gives the version number.
1.2.2
Patch for proxy servers: http:// not translated to http:/
1.2 (11-Nov-95)
Can configure columns in reports to give percentage requests and number of bytes.
Wild subdomains (e.g., *.com).
Nameless subdomains.
Subdomains now listed in alphabetical order.
Proper support for numerical hostnames in HOSTIGNORE, HOSTONLY, SUBDOMAIN and alphabetical sorting.
New BASEURL command allowing statistics to be displayed on other servers.
Output always says how things are sorted.
"Last 7 days" now behaves sensibly with TO.
Filenames containing /../, /./ and // translated.
Header and footer options removed from form (for security reasons).
1.1 (02-Oct-95)
Form interface introduced.
ASCII output now possible as well as HTML.
Output file can now be specified in the configuration file.
FROM and TO commands more powerful.
DEBUG and BACKGROUND introduced.
One bug fix: alphabetical sorting doesn't now swap some hostnames.
List of primes included in distribution.
1.0 (12-Sep-95)
Only minor changes since 0.94beta.
0.94beta (30-Aug-95)
New configuration variables SEPCHAR and REPORTORDER.
New configuration commands WITHARGS and WITHOUTARGS.
New commandline options +-A and +-x. (Config.: ALL and GENERAL).
Logfile entries with - as the return code are now regarded as successes, not corrupt entries.
Fixed bugs in host report when aliases or numerical hosts are present.
Documentation rewritten.
0.93beta (27-Jul-95)
Approximate hostname counting now possible in fixed memory.
New configuration commands ISPAGE and ISNOTPAGE.
New commandline option -v.
New configuration command WEEKBEGINSON.
Proper error message when memory exceeded.
Program split into several files.
0.92beta (11-Jul-95)
New reports introduced: hostname, full daily, and weekly.
FROM and TO commands introduced.
Header and footer files introduced.
More helpful warning messages.
Ability to read configuration instructions from stdin.
Subdomain commands moved from domains file to configuration file.
Makefile provided.
0.91beta (04-Jul-95)
Configuration file introduced, enabling many new options.
Some bug fixes and speed improvements.
Ability to print "top n" reports (rather than "everything higher than n").
Request report can print only pages.
Ability to try and resolve numerical addresses.
Now less fussy about the format of the domains file.
Logo added.
Readme converted to HTML.
0.9beta
More speed improvements, and some bug fixes.
Introduced -u option.
Introduced subdomain analysis.
Included "not modified" replies as successes, not redirects.
First public release at 0.9beta3. (29-Jun-95)
0.89beta (21-Jun-95)
Commandline arguments.
Efficiency improvements.
Host count and "last 7 day" statistics.
0.8beta (14-Jun-95)
Initial program, just default options.

Compiling and running the program

This section describes how to compile analog on Unix and VMS. If you've got the Mac or DOS version, the program comes already compiled, so you can skip to the next section.

If you want to get on with trying out the program straight away, you can leave most of this Readme until another time. The one thing you need to do is to look at the file analhead.h. These are all user-settable options, but most of them you can leave alone for the moment. You will probably want to check the first few options in the file, but you can even leave most of them until later.

Next you must move the images that came with the analog program (in the directory images) into the IMAGEDIR specified in analhead.h.

When you have done that, compile the program by typing

make
under Unix, or
MMS
using MMS under VMS. If that doesn't work, and you're on Unix, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again. In particular, Solaris 2 users need to change the LIBS= line. If you haven't got gcc, you will need to change the compiler - try acc or cc instead. If it still doesn't compile, try DEFS=-DNODNS to ignore the DNS lookup code.

Then just type

analog
to run the program. To send the output to a particular file instead of to the screen, type, e.g.,
analog > outfile.html
(This assumes that . is in you $PATH, but it should be. Otherwise try ./analog instead of just analog).

Customising analog

Pretty soon you will want to customise the output of analog to your personal preferences. How to do that is explained in this section. There are lots of options, so this section is rather long. But you won't want all the options straight away, so don't panic!

Many options can be set in the file analhead.h. If you're on Unix or VMS (or compiling your own version on another platform), these can be changed before compiling the program. They are explained in that file, so they will not be documented again here.

Otherwise, analog takes its options from configuration files. Many of the configuration commands also have abbreviations as commandline arguments. Don't get configuration commands mixed up with #define statements in the header file! (There aren't any commandline arguments on the Mac; you have to use the configuration commands. Ignore all talk of commandline arguments below).

So, for example, the configuration command

DAILY OFF
tells analog not to include a daily summary in the output. But this can also be specified by the command
analog -d
because the -d option is an abbreviation for DAILY OFF.

In fact any configuration command can be specified on the commandline by means of the +C option; you could write

analog +C"DAILY OFF"
(This is most useful for running analog from a script or cron job).

Analog comes with a small configuration file to get you started. To specify a configuration file, you use the commandline argument +g followed by the name of the file. (Mac users have no commandline arguments, so can only use the default configuration file). For example,

analog +gextra.conf
tells analog to read configuration commands from the file extra.conf. (Note that there is no space between +g and the filename; this is true of all commandline arguments). (You can also specify standard input as the configuration file by the option +g-).

The configuration file can contain several commands on separate lines; any text after a hash (#) on a line is ignored as a comment. So the following is an example of a configuration file.

DAILY      OFF   # We don't want a daily summary
FULLDAILY  ON    # We want a full daily report instead 
An argument to a command can be placed in single or double quotes, and it must be if the argument contains a hash or a space. Note that configuration commands are not generally the same as those in analhead.h, although many have the same name.

Commandline arguments are read in the order in which they occur, and configuration files are read when the +g argument is reached. If commands conflict, later commands override earlier ones, so the order does matter.

There are also two special configuration files which can be specified in analhead.h. The default configuration file is run before all other configuration files. You can put in there configuration commands that you normally want to include but which you can override. You can stop analog running the default configuration file by the commandline option -G.

The mandatory configuration file is run after all other configuration commands have been read, and overrides them all. If the mandatory configuration file cannot be found, the program exits immediately. This can be used by system administrators to stop users analysing certain files or producing certain reports, for example. (Note, however, that the only way to stop it completely is to deny users read access to the logfile. Otherwise there is nothing to stop them analysing it by another copy of analog or another program).

If this is all a bit confusing, just run

analog -v [other options]
That will tell you what the values of all the variables will be, based on analhead.h, the configuration options and the commandline options.

We shall now look at all the configuration commands and their commandline equivalents under the following headings. There is a summary list of all of them in the reference section.


General Summary

Program started at Mon-26-Jun-1995 17:09 local time.
Analysed requests from Thu-28-Jul-1994 20:31 to Mon-26-Jun-1995 17:09 (332.8 days).
Total successful requests: 368 063 (12 872)
Average successful requests per day: 1 219 (2 121)
Total successful requests for pages: 142 422 (4 971)
Total failed requests: 4 089 (139)
Total redirected requests: 35 277 (1 838)
Number of distinct files requested: 966 (336)
Number of distinct hosts served: 28 589 (1 589)
Number of new hosts served in last 7 days: 1 037
Corrupt logfile entries: 869
Total data transferred: 1 766 Mbytes (83 743 kbytes)
Average data transferred per day: 5 415 kbytes (11 963 kbytes)
(Figures in parentheses refer to the last 7 days).

See the Glossary for the meaning of these data.

The general summary can be turned off by the command

GENERAL OFF
(or the commandline argument -x) or on by GENERAL ON (or +x). If the general summary is off, all the `Go To' links in the output are also omitted.

The figures in parentheses refer to the last 7 days. They can be turned on and off with

LASTSEVEN ON    # or OFF
or with the commandline arguments +7 and -7. Note that the last 7 days refers to the last 7 days before the program is run, not before the last entry in the logfile. (If a TO command is specified, however, the last 7 days will be until that date).

Counting hosts is something which can take a lot of memory (we have to remember the name of every host that has accessed our server). If memory is a problem, you can turn the host counting off with the commandline option -s or the configuration command

COUNTHOSTS OFF
Alternatively, you can do an approximate host count in a fixed (pre-specified) amount of memory. You do this by using +ss or
COUNTHOSTS APPROX
and you can specify the amount of memory to be used by
APPROXHOSTSIZE 100000  # or whatever number, in bytes
About 3 bytes per host seems to give a very good estimate. Even 1 byte per host will give a fair estimate. If statistics for the last 7 days are on, twice this amount of space will be used. COUNTHOSTS APPROX uses much less memory than an accurate count, but can be a bit slower. If the host report is on, COUNTHOSTS will always be turned on automatically, so to turn COUNTHOSTS to off or approximate, you also need to turn the host report off.

Time reports

Each unit (+) represents 4 000 requests, or part thereof.


   month:  #reqs: 
--------  ------  
Nov 1995: 119865: ++++++++++++++++++++++++++++++
Dec 1995: 121214: +++++++++++++++++++++++++++++++
Jan 1996: 144960: +++++++++++++++++++++++++++++++++++++

The above display is of a monthly report. In this category, we also have the weekly report (one line for each week), daily summary (one line for Sundays, one for Mondays etc.), daily report (one line for each day ever), hourly summary (one line for midnight, one for 1am etc.) and hourly report (one line for each hour ever).

The following configuration commands show how to turn these reports on and off.

MONTHLY ON
WEEKLY  ON
DAILY   ON
FULLDAILY OFF
HOURLY ON
FULLHOURLY OFF
You can also use the corresponding commandline arguments +m, +W, +d, -D, +h, -H (use + to turn the corresponding reports on, - to turn them off).

You can specify the maximum number of rows in one of these reports by a line like

FULLHOURROWS 72   # restrict the hourly report to the last 72 hours
MONTHROWS 0       # 0 means no restriction
The other commands are WEEKROWS and FULLDAYROWS.

You should use these reports sensitively. If your output is 500k long, people won't be able to download it. In particularly, you probably don't want a daily report or hourly report unless you have restricted it to just a few rows.

The graphs above are designed to produce coloured bars on graphical browsers and ASCII graphs on non-graphical browsers. They don't use tables or image-stretching properties, so should work on any browser. However, you can produce plain ASCII graphs instead by the command

GRAPHICAL OFF    # or ON to turn it back on again
This has the advantage of producing smaller output which does not require any images to be downloaded.

The graphs rely on having the images distributed with analog available in the directory IMAGEDIR specified in analhead.h; or you can override that choice with a command like

IMAGEDIR /Images/

You can change the character used in the graphs on non-graphical terminals by means of a command such as

MARKCHAR '#'  # put in quotes so that it isn't a comment

The graphs can be plotted by bytes transferred or requests for pages instead of by raw requests. This can be done by means of commands like

MONTHGRAPH B    # by bytes
WEEKGRAPH  R    # by requests
DAYGRAPH   P    # by page requests
There are also commands FULLDAYGRAPH, HOURGRAPH and FULLHOURGRAPH. Alternatively, you can add the letter after the relevant commandline argument; for example, +hB to turn on the hourly summary with a graph sorted by bytes. To specify what counts as a page, see the ISPAGE command below. These commands do not change which columns are displayed on each line, so if you use these commands, you might also want to use the COLS commands explained below.

You can display the graphs backwards (with most recent requests at the top) by means of commands like

MONTHLYBACK ON  # or OFF
There are also the commands WEEKLYBACK, FULLDAILYBACK and FULLHOURLYBACK. The hourly summary and daily summary cannot be displayed backwards. I find it confusing to have some of the reports going backwards and some forwards, so you can also use
ALLBACK ON  # or OFF
to change all four of the reports to backwards or forwards together.

You can specify which columns appear in the various reports in which order. The above example showed the number of requests being given. You can also have the percentage of the requests, the number and percentage of bytes, and the number and percentage of requests for pages. For example, the command

MONTHCOLS RBbrpP
tells analog to include in the monthly report columns for number of requests (R), number of bytes (B), percentage of bytes (b), percentage of requests (r), percentage of page requests (p) and number of page requests (P) in that order. The other commands are WEEKCOLS, DAYCOLS, FULLDAYCOLS, HOURCOLS and FULLHOURCOLS. If you use these commands, you might also want to use the GRAPH commands explained above.

For some reports, analog needs to know where weeks begin and end. You can specify

WEEKBEGINSON WEDNESDAY
to change it to Wednesday, for example. (I guess Sunday or Monday is more likely).

In the graphs, analog will choose the value of the unit (+) automatically based on the length of the largest bar and the width of the page. You can specify the page width with, for example,

PAGEWIDTH 70
or the commandline option +w70. (I find about 65 works well). (Note that the PAGEWIDTH may not be strictly obeyed with GRAPHICAL ON, as the graphics are measured in pixels not characters). Occasionally you may want to specify the value of + yourself (for example, to make it the same as on some other page). You can do this by a command like
MONTHLYUNIT 1000
Setting it to 0 makes analog choose it automatically again. Of course, the other reports have WEEKLYUNIT, DAILYUNIT, FULLDAILYUNIT, HOURLYUNIT and FULLHOURLYUNIT.

Other reports

This section discusses the following reports.

Domain report

  #reqs :  %bytes : domain
--------  --------  ------
 103125 :  46.58% : .uk (United Kingdom)
( 64982):( 35.45%):     cam.ac.uk (University of Cambridge)
( 47138):( 20.55%):       statslab.cam.ac.uk
  49290 :  12.49% : .edu (USA Educational)

Host report

#reqs: %bytes: host
-----  ------  ----
   10:  0.03%:          zlsm03.arcs.ac.at
   11:  0.04%:           iki10.boku.ac.at
  158:  0.15%:       talus.maths.su.oz.au

Directory report

#reqs: %bytes: directory
------  ------  ---------
237985: 35.40%: /~sret1/
 18596: 17.60%: /~rrw1/
  3574: 11.89%: /~richard/

Request report

#reqs: %bytes: filename
-----  ------  --------
33980: 23.66%: /~sret1/backgammon/main.html
21162:  2.69%: /~sret1/backgammon/bitmaps/board.xbm
12690:  0.86%: /

File Type report

 #reqs: %bytes: extension
------  ------  ---------
 25592: 35.68%: .html
 23311: 20.15%: (directories)
  1080: 17.13%: .ps
175575: 13.63%: .gif

Referrer report

#reqs: referring URL
-----  ------------
  260: http://webcrawler.com/cgi-bin/WebQuery
  239: http://www.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/Servers/Log_Analysis_Tools/
  185: http://guide-p.infoseek.com/WW/NS/Titles?qt=backgammon&col=WW
  149: http://www.yahoo.com/Recreation/Games/Board_Games/Backgammon/

Browser summary

#reqs: browser
-----  -------
16797: Mozilla
 1532: Mosaic
  693: IWENG
  492: Lynx

Browser report

#reqs: browser
-----  -------
 3105: Mozilla/1.22 (Windows; I; 16bit)
 2785: Mozilla/1.1N (Windows; I; 16bit)
  458: IWENG/1.2.003

These reports can be turned on and off with commands like

DOMAIN    ON
FULLHOSTS OFF
DIRECTORY ON
REQUEST   ON
FILETYPE  ON
REFERRER  OFF
BROWSER   ON
FULLBROWSER  OFF
or with the commandline arguments +o (domain report), -S (host report), +i (directory report), +r (request report; see below), +t (file type report), -f (referrer report), +b (browser summary) and -B (browser report). (As in the date reports, use + to turn the corresponding reports on, - to turn them off). Because of the widespread mis-spelling, REFERER is accepted as a synonym of REFERRER.

Another similarity with the date reports is that you can tell analog which columns to print on each report with the commands DOMCOLS, HOSTCOLS, DIRCOLS, REQCOLS, TYPECOLS, REFCOLS, BROWCOLS and FULLBROWCOLS. Again, each command is followed by letters indicating which columns are wanted and in which order. For example,

DOMCOLS RrBb  # no. of reqs, %age reqs, no. of bytes, %age bytes
DIRCOLS Pp    # no. of pages, %age pages

Each of these reports can be sorted in five different ways; by bytes, by requests, by requests for pages, alphabetically or randomly (i.e., unsorted). (The only advantage of the last one is so as not to spend time sorting very long reports). The commands to change this look like

DOMSORTBY BYTES  # or REQUESTS or PAGES or ALPHABETICAL or RANDOM
The commands for the other reports are HOSTSORTBY, DIRSORTBY, REQSORTBY, TYPESORTBY, REFSORTBY, BROWSORTBY and FULLBROWSORTBY. You can also add a letter b, p, r, a or x after the relevant commandline option; for example, +Sa for a host report sorted alphabetically.

It is important to be able to specify how many entries you want printed in each report. This is done by means of three variables for each report, one specifying the minimum number of bytes if the sorting is by bytes, one the minimum number of page requests if it is by pages, and the third specifying the minimum number of requests if the sorting is by any of the other three methods. The following configuration commands illustrate the possible usages.

DOMMINREQS 20      # all items with at least 20 requests
HOSTMINREQS -20    # the first 20 items
                   # NB: useless if alphabetical or random sort
REQMINREQS 0.01%   # all items with at least 0.01% of the requests
TYPEMINPAGES 20    # at least 20 page requests; -20 and 0.01% also possible
DIRMINBYTES 100000 # all items with at least 100000 bytes
REFMINBYTES 100k   # all items with at least 100 kbytes
                   # (10M etc. also work)
BROWMINBYTES -40   # Top 40 if sorting is by bytes
FULLBROWMINBYTES 0.005%   # all with at least 0.005% of the traffic
You can also specify the amount on the commandline by adding it after the sort method. For example, +Sr-50 turns on a host report, sorted by requests, with only the top 50 items included, and +ib20k gives a directory report, sorted by bytes, including all directories with at least 20 kilobytes transferred.

You can translate items in the reports for the benefit of your readers. For example, the command

REQOUTPUTALIAS /~sret1/analog/ "Analog home page"
would make Analog home page appear instead of /~sret1/analog/ in the request report. Wildcards can appear in the aliases: for example
REQOUTPUTALIAS /~sret1/* "Stephen's page (/*)"
would translate /~sret1/backgammon to Stephen's page (/backgammon/) etc. The commands for the other reports are DIROUTPUTALIAS, HOSTOUTPUTALIAS, REFOUTPUTALIAS, BROWOUTPUTALIAS and TYPEOUTPUTALIAS.

Each of the reports has a hash size associated with it, which is the size of the table in which it stores the data internally. You don't need to worry about this usually; it doesn't affect the output, but if analog starts running slowly, you might find that making the hash sizes larger or smaller helps. The command to do this for the request report is

REQHASHSIZE 1009
The command for the other reports are DIRHASHSIZE, TYPEHASHSIZE, HOSTHASHSIZE, REFHASHSIZE, BROWHASHSIZE, FULLBROWHASHSIZE and SUBDOMHASHSIZE (for subdomains; the top-level domains don't use this). On appropriate platforms, there is also DNSHASHSIZE for DNS lookups. You must choose a prime number for the hash size (there's a list of some primes distributed with the program). Maybe half the number of items of that type expected is a good number, but it shouldn't be critical.

We now describe features unique to a particular one of the reports. First the domain report.

Subdomains can be specified for each domain. The syntax of the command is

SUBDOMAIN subdomain subdomain_name
If the subdomain name has spaces in, it must be enclosed in quotes. The subdomain name can be omitted, indicating a nameless subdomain. For example, to produce the example above, I would include the following lines in the configuration file
SUBDOMAIN cam.ac.uk 'University of Cambridge'
SUBDOMAIN statslab.cam.ac.uk
Numerical subdomains (which have most significant part on the left) can also occur. They will look like
131   The Ever-Popular 131 domain
131.111   # Nameless
Also subdomains with wildcards in can occur; they can't have names. The following are examples:
SUBDOMAIN *.edu       # mit.edu, umn.edu  etc.
SUBDOMAIN 131.111.*   # 131.111.1, 131.111.2 etc.
SUBDOMAIN %           # all top-level numerical domains, from 1 to 255
The variables SUBDOMMINREQS and SUBDOMMINBYTES can be specified in the same way as above, except they can't be negative. If you ask for wild subdomains, you will probably want to set the minimum requests and minimum bytes quite high. However, you cannot alter the sort order; within a domain, subdomains will always be output in alphabetical order.

There is a command NOTSUBDOMAIN to erase a previously requested subdomain. For example, you can write

NOTSUBDOMAIN *.edu
NOTSUBDOMAIN cam.ac.uk
However, if you request, for example, *.edu, then NOTSUBDOMAIN mit.edu will not override it.

The domain report relies on having a domains file available, listing which geographical locations correspond to which domains. Which file is to be used as the domains file can be specified by the command

DOMAINSFILE domainsfile
The correct format of the domains file is explained in a separate section.

There is little to say about the host report, except to note that alphabetical sorting is by domain as most significant part. This report can be very long and so slow to sort, and should be used with a high floor if at all.

The directory report has one further variable, which is the level (or depth) of the directory report. The example above is a level 1 report; a level 3 report might look like

#reqs: %bytes: directory
------  ------  ---------
 43772: 72.06%: /~sret1/backgammon/
173426: 19.93%: /~sret1/backgammon/bitmaps/
 11298:  4.14%: /~sret1/
This can be specified by the commandline option +l3 or the configuration command
DIRLEVEL 3
Note that the figures for each directory do not include those for the subdirectories of that directory, except where the directory is at the deepest level. So in the above example, /~sret1/backgammon/bitmaps/dice/d1.xbm would be reckoned in the directory /~sret1/backgammon/bitmaps/ (which is at the deepest level) but not in the other two directories.

You can control which items get listed, and which linked to, in the request report with the commands REQINCLUDE and LINKINCLUDE. These are explained below.

There is a command BASEURL to specify a URL to prepend to the links. For example, if

BASEURL http://www.statslab.cam.ac.uk
were specified, then /~sret1/analog/ would be linked to http://www.statslab.cam.ac.uk/~sret1/analog/. This is useful if you want to display the statistics on a different server than the one they belong to. (See below for combining logfiles from two different servers).

There's nothing special to say about the file type report.

The referrer report only has one special command, REFLINKINCLUDE to say what should be linked to in the report. It is explained below. (However, it is important to note that many browsers do not pass this information to the server, and many pass it wrongly (sending the URL of the previous page even when your page was not reached by selecting a link from that page)).

For the referrer report and the browser reports the relevant logfiles must be present on the system (see below for how to specify where they are). Note that if you are using separate logfiles, rather than the NCSA combined log, you cannot sort these reports by bytes, or include bytes columns in the reports. Also the browser page requests will be inaccurate.

The browser summary and browser report have no special commands, but it is important to note the limitations of these reports. Some browsers even lie deliberately about what sort of browser they are, or let users configure the browser name. I have separated out those browsers that claim to be "Mozilla (compatible)" but that doesn't catch all of them. Furthermore, there is no fixed format for browser information. (NB: I have combined all Mosaics as a special case). In addition, graphical browsers automatically generate more requests than non-graphical browsers by loading the graphics, so it is not a very good guide to browser usage. For all these reasons many people would argue that the browser reports are so unhelpful as to be worse than useless. At best, interpret them with extreme caution.


Error and status code reports

The error report lists all the errors found in your error log:
#occs: error type
-----  ----------
19360: Send timed out
11286: Send aborted
 7962: File does not exist
The status code report lists how many of each type of status code occurred in your logfile:
#occs: no. description
-----  ---------------
35564: 200 OK
  173: 301 Document moved
    3: 302 Document found elsewhere
 5732: 304 Not modified since last retrieval
They are turned on and off by commands like
STATUS ON
ERROR  OFF
or by the commandline arguments +c and -e. (+ for on, - for off). There is a command ERRMINOCCS which says how many occurrences of an error there must be before it appears on the error report. For example
ERRMINOCCS 20

What to analyse

The first thing to know is how to specify a different logfile to analyse. A default one should have been specified in analhead.h, but you can also specify one by just putting its name on the commandline; so, for example, the command

analog logfile.log
will use that logfile for its report. Analog will read the common log format (which most servers write) as well as the old NCSA format and the NCSA combined log format (which includes referrer and agent information). Detection of which format each line of the logfile is in is automatic. You can also write
analog -
to use standard input as the logfile. (This is useful in constructing pipes). You can also specify which logfile to use in the configuration file by means of a command like
LOGFILE logfile.log   # or stdin for standard input
(This is the only method Mac users can use). You can specify several logfiles on one configuration line by separating their names with commas (no spaces). You can also use wildcards. For example
LOGFILE log_96*,/logs/*/access_log.*
If you make a very long list, you might need to split it in two (or increase MAXSTRINGLENGTH in analhead.h).

Sometimes it is necessary to combine logfiles from two different servers, without getting filenames that happen to be the same on both servers confused. (This is particular useful if you're running several virtual hosts on the same machine). To do this you can use a second argument to the LOGFILE command, specifying a prefix for each filename. For example

LOGFILE log1,log2  http://www.a.com   # These logfiles from a.com
LOGFILE log3       http://www.b.com   # This one from b.com
If you use this, the directory report will need specifying to a deeper level.

Logfiles specified in the user's configuration files and commandline options replace any specified in the default configuration file, and are in turn overridden by any in the mandatory configuration file. In addition you can use none as the name of the logfile to overwrite the specification of all previous logfiles.

Analog can also read the NCSA/Apache referrer log, agent log and error log formats. Logfiles of these types can be specified by commands like

REFLOG   referer_log
BROWLOG  agent_log.old,agent_log
ERRLOG   error_log
The same comments about which logfiles replace which apply as in the last paragraph. Analog can also read the NCSA/Apache combined log. This is just specified as a LOGFILE as above; analog will automatically recognise and parse the extra fields. Do not specify a combined log as a REFLOG or BROWLOG.

You can decide whether the filenames in the logfile should be regarded as being case sensitive or case insensitive. The default is usually to be case sensitive on Unix and case insensitive on other machines, but you can override it if you want to read a logfile that was created on another machine. The commands to do this are

CASE INSENSITIVE
CASE SENSITIVE
A note about reading logfiles created on other machines. Different machines have different ways of storing files, in particular with regard to ends of lines. If you are moving logfiles from one machine to another, they should be transferred as text, not as binary data.

Analog on Unix can uncompress compressed logfiles. You need to tell it how to uncompress each type of file by supplying a command that sends the uncompressed file to standard output (rather than uncompressing it into a file). The file can be a list of type of files, separated by commas. For example, depending what commands are on your system, you can use

UNCOMPRESS *.gz      "gunzip -c"  # or
UNCOMPRESS *.gz,*.Z  gzcat
This would be a suitable command to include in the default configuration file. Note that if you are using the form interface, the http server needs to have execute access to the decompression program.

Several times I've referred to `page requests'. You can specify in the configuration file what should be counted as a `page'. At the beginning, only *.html, *.htm and directories (*/) are pages. The command

ISPAGE filename
will specify that some other file is a `page'. You can give a list of filenames, separated by commas (without spaces). For example,
ISPAGE *.ps,*.ps.gz
would mean that Postscript files and gzipped Postscript files are to be regarded as pages. You can also use
ISNOTPAGE filename
to specify that something which would otherwise be a page is not to be regarded as a page.

There are various commands which instruct the program to analyse only part of the logfile. First, you can instruct the program only to take into account certain files. This is done by means of the FILEINCLUDE and FILEEXCLUDE commands. Each command can have a list of filenames, separated by commas (no spaces). One asterisk and any number of question marks can appear in each of the filenames specified, as wildcards. Each file is included and excluded as each new command is reached. Unspecified files are included if the first command found was an exclusion, and excluded if the first command found was an inclusion. For example, the configuration

FILEINCLUDE /~sret1/*
FILEEXCLUDE /~sret1/backgammon/*,/~sret1/analog/*
FILEINCLUDE /~sret1/backgammon/*.gif
would instruct the program to examine only my files, excluding my backgammon and analog files, but including gifs in my backgammon directory. On the other hand,
FILEEXCLUDE /~sret1/*/img/*
would analyse all files except images in my various directories. (Note that wildcards with two *'s in can be slow to process).

Remember you can always run analog -v to see what the options you have specified represent.

Included files can always be excluded later, but excluded files can't always be included easily. (For example, after FILEEXCLUDE /dir/* and FILEEXCLUDE *.gif,*.jpg, FILEINCLUDE *.gif would include all gifs, even those in /dir/, which is not what is wanted). For this reason, there is an extra command FILEALLOW which cancels an exclude. It must be exactly the same as a previous FILEEXCLUDE; in the above example FILEALLOW *.gif would work, but not FILEALLOW *if.

Note that although you can exclude all gifs with FILEEXCLUDE *.gif, this may not be what you want to do. This will then exclude them from all the reports, and not count the bytes transferred due to them. More likely, you just want to exclude them from the request report while still including them in the other reports, which you can do by means of REQEXCLUDE *.gif (or REQINCLUDE pages) which will be explained in a minute.

There are similar commands HOSTINCLUDE, HOSTEXCLUDE and HOSTALLOW to analyse only the requests from certain sites. For example,

HOSTEXCLUDE emu.pmms.cam.ac.uk
HOSTEXCLUDE *.statslab.cam.ac.uk
would ignore accesses from emu and from the whole of the statslab.

There are also commands REFINCLUDE, REFEXCLUDE and REFALLOW for referrers. You probably want to ignore referrers from your own site. For example, I use

REFEXCLUDE http://www.statslab.cam.ac.uk/*
This would be a suitable command to put in your default configuration file.

There are some other include commands that are specified the same way, but behave slightly differently because they do not actually exclude something from the analysis. So to specify what is included in the request report, there are commands REQINCLUDE, REQINCLUDE and REQALLOW. You can use the special name `pages' to mean all pages. So for example,

REQINCLUDE pages
REQINCLUDE *.ps
will only include pages and Postscript files in the request report, although other files will still be counted for the other reports. There are also commands LINKINCLUDE, LINKEXCLUDE and LINKALLOW, and REFLINKINCLUDE, REFLINKEXCLUDE and REFLINKALLOW to say what should be linked to in the request report and referrer report.

Finally, there are commands to analyse only a subset of the dates in the logfile. The simplest usage is FROM yymmdd and TO yymmdd. So, for example, to analyse only requests in July 1995 I would use the configuration

FROM 950701
TO   950731
Also each of the pairs of digits can be preceded by - and the month and date can by preceded by + to represent time relative to the current date. This allows constructions like
FROM -01-00+01   # from tomorrow last year
TO -00-0131  # to the end of last month (OK even if last month
             # didn't have 31 days)
FROM -00-00-112
TO   -00-00-01  #statistics for the last 16 weeks
There are commandline abbreviations +F and +T for these commands; for example +T-00-00-01 looks at statistics until the end of yesterday. -F and -T turn off the from and to, as do FROM OFF and TO OFF.

If a TO command is given, the figures for the last 7 days refer to the time until then.


Aliases etc.

There are commands to give aliases for filenames, hostnames and referrers. The configuration line
FILEALIAS file1 file2
says that whenever file1 occurs in the logfile, it is to be replaced by file2. Analog already understands that /dir/index.html is the same as /dir/ and translates `escaped' entities (e.g., %7E is the same as ~) so these don't need to be specified separately. It also understands that .. means `parent directory,' . means `this directory' and // is the same as /, and translates those filenames to their canonical forms.

Actually, it's not quite true about index.html. You can make that into another file if you want, by use of the DIRSUFFIX command. For example, if all your directories return indexes called default.htm, you could write

DIRSUFFIX default.htm

Wildcards can occur in the aliases. For example, after

FILEALIAS   /~sret1/*.gif   /images/*g.gif
FILEALIAS   /~sret2/a?c*    /sa/*
/~sret1/a.gif would be translated to /images/ag.gif and /~sret2/abcd.txt would become /sa/d.txt.

If two aliases match one filename, only the first one is applied. So after FILEALIAS a b; FILEALIAS b c or FILEALIAS a b; FILEALIAS a c, a will be translated into b.

There are also the commands HOSTALIAS, BROWALIAS (for browsers) and REFALIAS (for referrers) which work in the same way. HOSTALIAS is particularly useful if your server records local hostnames in the logfile instead of full internet names. Also, if a host has two names, they can be combined in this way. So, for example, I might find it convenient to use

HOSTALIAS lion lion.statslab.cam.ac.uk
HOSTALIAS bigcat lion.statslab.cam.ac.uk
HOSTALIAS bigcat.statslab.cam.ac.uk lion.statslab.cam.ac.uk
REFALIAS could be used to combine several referrers from one site. For example
REFALIAS http://www.webcrawler.com/* http://webcrawler.com/
REFALIAS http://webcrawler.com/* http://webcrawler.com/
BROWALIAS could be used to combine all similar versions of a browser. For example
BROWALIAS Mozilla/2* "Netscape 2"
But be careful! The same version number often doesn't mean the same thing on different platforms.

There are also the OUTPUTALIAS commands, but I described them above.

A pair of related commands is WITHARGS and WITHOUTARGS. Normally any arguments given as part of a URL (after a question mark) are ignored. However, if a configuration command like

WITHARGS /cgi-bin/prog.cgi
is given, then the arguments to that file will form part of the filename. So /cgi-bin/prog.cgi?a and /cgi-bin/prog.cgi?b will be regarded as separate files, whereas without that command they would both have been translated to /cgi-bin/prog.cgi. Note that the filename with the arguments still has to fit inside the maximum length of a filename. Asterisks and lists of files can again occur, and there is also a parallel command WITHOUTARGS; for example,
WITHARGS /cgi-bin/*
WITHOUTARGS /cgi-bin/spam.cgi
would read the arguments for all files in /cgi-bin/ except spam.cgi.

Commands REFWITHARGS and REFWITHOUTARGS work in the same way for referrers, except that in this case the default is to include all the arguments (so that you can see what people are requesting from search engines).


Cache files

Analog has the ability to archive some of the data in your logfile in a cache file so that the logfile can be thrown away without losing the most important data. Important: The information that is recorded is only that which identifies how many successful requests there were at each time. No information on which files the requests were for, or where the requests were from, is kept. Neither is information on failed requests. So from the cache file you can reconstruct the time reports but not any other reports.

To produce a cache file instead of the normal output, use the command

OUTPUT CACHE
To read data from a cache file, use, e.g.,
CACHEFILE cache.out
(This will still read the ordinary logfile as well). You can also use the commandline argument +Ucache.out. You can specify several cache files by putting them in a comma-separated list, or using several +U commands. Note that this doesn't write to that file. You still write to the normal output file.

To use this feature and avoid losing entries or double counting them, I suggest you follow the following procedure.

  1. Stop the server.
  2. Move the old logfile to a new location.
  3. Restart the server with a fresh logfile.
  4. Make a cache file from the old logfile.
  5. Make an ordinary report from the old logfile too.
  6. Make a report from the cache file and no other logfile to check it works.

Although it should now be safe to throw away the old logfile, I can take no responsibility if something goes wrong. See the licence. Also if you are going to use this feature please make sure you understand what information is and is not recorded in the cache file. You may find that the cache file is not the right feature for you. Compressing logfiles (with gzip -9) is very efficient owing to the large number of repeated strings. That in itself may solve your filespace problems.


DNS lookups

Sometimes servers do not record the name of the host that called you in the logfile, only its number. Analog can try and resolve these numerical addresses. These feature is not available on DOS at the moment.

To turn DNS resolution on, use the configuration command

NUMLOOKUP ON
or the commandline option +1. (Turn it off with NUMLOOKUP OFF or -1).

The first time you use lookups, analog may be very slow. But it will record which addresses it looked up so that you do not need to look them up again next time you run analog. You need to specify a file for this purpose. This is done by the command

DNSFILE filename
The program will first read any old lookups that are recorded there, then overwrite it with a new version at the end.

However, any lookups that are too old will not be trusted, and will be thrown away. You can specify how old a lookup in hours you trust by a command like

DNSFRESHHOURS 168  # check them once a week
Note also that not all numbers can be resolved. Sometimes the DNS lookup will time out, or sometimes that number just doesn't have a name corresponding to it. In that case, the host will still be listed as an unresolved address.

There is also a variable DNSHASHSIZE; see above.


Miscellaneous options

The final group of options is those which affect the layout of the output and other miscellaneous options.

First, one very important option: the language for the output! You can specify any of the following

LANGUAGE ENGLISH
LANGUAGE US-ENGLISH
LANGUAGE FRENCH
LANGUAGE GERMAN
LANGUAGE SPANISH
LANGUAGE ITALIAN
LANGUAGE DANISH
(If you are using a language other than English, you might also want to produce a local version of the domains.tab file, and of the form interface). If anyone wants to translate the output into another language, I would be delighted! (But contact me first, so that I can make sure that two people aren't working on the same language).

Next, you can choose whether you want ASCII (plain text), HTML or preformatted output. (Preformatted output is a special machine-readable format, used for importing into spreadsheets or graphics-creation programs. It is described in a separate section below). The output format is chosen using the commandline option +a (ASCII) or -a (HTML), or the configuration command

OUTPUT ASCII  # or HTML, or PREFORMATTED
If you choose ASCII or PREFORMATTED output, some of the other options are ignored, but it should be obvious which ones they will be.

You can select the file for the output to be sent to in the configuration file or on the commandline. So instead of

analog > outfile.html
you can use the configuration command
OUTFILE outfile.html
or the commandline option +Ooutfile.html.

There is a configuration command REPORTORDER which specifies which order the reports should occur in. The usage is a line like

REPORTORDER hHDdWmoSirtfbBec
This says that the reports should occur in the order hourly summary (h), hourly report (H), daily report (D), daily summary (d), weekly report (W), monthly report (m), domain report (o), host report (S), directory report (i), request report (r), file type report (t), referrer report (f), browser summary (b), browser report (B), error report (e) and status code report (c). It is important to include all the above sixteen letters exactly once each.

There is a command

ALL ON
to include all reports except the hourly report (particular ones can then be omitted with -d or whatever); likewise ALL OFF omits them (and particular ones can then be included). The equivalent commandline arguments are +A and -A. The hourly report and general report are not turned on by ALL ON or +A; they must be turned on separately with +H and +x. Note also that order is important; for example, +i -A +r will include the request report but not the directory report.

The title line of the output page contains three adjustable variables. First, the logo in the top left hand corner can be turned on or off, or any other logo substituted (for example, your organisation's logo). This is accomplished by the command

LOGOURL url   # or
LOGOURL none  # for no logo
or by the commandline arguments -p (no logo: mnemonic, p for picture) and +pURL. The organisation name on the title line can be specified by means of the option -nname; the hostname of your server would also be an appropriate thing to put here. The name can have a link to your server's home page by use of the option -uURL; use -u- if you don't want any link. The equivalent configuration options are
HOSTNAME  name  # must be in quotes if it contains spaces
HOSTURL   URL
HOSTURL   -     # for no link
Analog will normally translate characters in the hostname to HTML if necessary. So to include literal HTML, such as accented characters, in the output you need to precede them by a backslash, like this:
HOSTNAME "M\üller & S\öhne"

A header file and footer file can be inserted near the top and bottom of your output. These should be written in HTML or ASCII according to whether your output is HTML or ASCII, and can contain anything you want. Possible uses include providing information about your organisation or about the way the statistics were calculated, linking to related pages, and no doubt many other things. The commands to achieve this are

HEADERFILE filename
FOOTERFILE none      # if you don't want one
You can also use HEADERFILE stdin or HEADERFILE - to use standard input.

There is a command SEPCHAR to say which character should separate each group of three digits in long numbers. For example,

SEPCHAR ,
will give 123,456,789, whereas
SEPCHAR ' '
will give 123 456 789. If you want the numbers just to run together (123456789) use
SEPCHAR ''
You can also specify a character for use within the tables in the reports. This is done in the same way by means of the command REPSEPCHAR.

You can also choose a character for the decimal point. For example, some languages use a comma instead of a full stop; you would specify this by the command

DECPOINT ,

You can specify whether analog prints long numbers of bytes as exact numbers (e.g., 5,053,234) or as kilobytes, megabytes etc. (e.g., 4934k) by the command

RAWBYTES ON  # for exact, OFF for abbreviated

There is a debugging command, for printing (to stderr) problems with your logfile. There are currently three levels of debugging: 0 for no debugging; 1 for printing corrupt logfile lines (prepended by "C:"), information on files opened and closed (prepended by "F:"), and some summary data (prepended by "S:"); and 2 which also prints hosts for which the domain is unknown (prepended by "U:"), errors which cannot be classified (prepended by "E:") and any DNS lookups carried out (prepended by "D:"). The command for level n debugging is

DEBUG n
and the equivalent commandline argument is +Vn (V for verbose). You can also use commandline options +V for level 1 and -V for level 0.

You can also get a report on analog's progress. The command

PROGRESSFREQ n
prints a report after every n lines of input.

There is an option to turn off warnings. It is

WARNINGS OFF  # or ON
The equivalent commandline argument is -q to turn warnings off (q for quiet) and +q to turn them on again. This is useful in scripts or cron jobs if you really do want to give a configuration that you know will generate a warning.

Finally, configuration files can call other configuration files, by means of a command like

CONFIGFILE other.cfg

The domains file

The file domains.tab, to translate internet country codes to locations, should have come with the program. It is needed to construct the domain report. If you haven't got one, you can download one from http://www.statslab.cam.ac.uk/~sret1/analog/analog/domains.tab. It should be in the following format:
ad   Andorra
ae   United Arab Emirates
[...]
There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use ? (or anything starting with ?) for the name if you want the domain to be recognised, but don't want the name to be printed out. The domains do not need to be in alphabetical order, though humans may prefer it that way.

Comments can occur in the domains file. They are introduced by the character #. So you could write, for example,

uk  United Kingdom  # God save the Queen!

Preformatted output

This section describes a special output format called preformatted output. It is selected using the configuration command
OUTPUT PREFORMATTED
This type is designed to be easy to read into spreadsheets, or post-process with graphics creation tools. Each line is separated into columns with a special string of characters. You can specify this string with the PRESEP command; for example
PRESEP :::
if for some reason you wanted three colons between each column. Make sure not to use anything that might occur in the output: a space would not be suitable.

Each line in the preformatted output begins with a letter indicating what type of information the line contains. The possible letters are as follows:

x
general summary
h
hourly summary
H
hourly report
D
daily report
d
daily summary
W
weekly report
m
monthly report
o
domain report
O
subdomain in domain report
S
host report
i
directory report
r
request report
t
file type report
f
referrer report
b
browser summary
B
browser report
e
error report
c
status code report
Except for the general summary, there then follows which columns are included in the report (using the letters RrBbPp as usual). There then follows the numerical data, and finally the identification of the item. (In the case of the time reports, this can take several columns. They are listed with most significant time first (year - month - date - hour - minute)).

The general summary is a bit different. After the initial x, follows a two-character code saying what the line contains. The possible codes are

HN
HOSTNAME
HU
HOSTURL
PS
Program start
FR
Time of first request
LR
Time of last request
L7
Time last 7 days extends from
SR
Total successful requests
S7
Total successful requests in last 7 days
PR
Total successful requests for pages
P7
Total successful requests for pages in last 7 days
FL
Total failed requests
F7
Total failed requests in last 7 days
RR
Total redirected requests
R7
Total redirected requests in last 7 days
NF
Number of distinct files requested
N7
Number of distinct files requested in last 7 days
NH
Number of distinct hosts served
AH
Approximate number of distinct hosts served
H7
Number of distinct hosts served in last 7 days
A7
Approximate number of distinct hosts served in last 7 days
NV
Number of new hosts served in last 7 days
AV
Approximate number of new hosts served in last 7 days
CL
Number of corrupt lines in the logfile
UL
Number of unwanted lines in the logfile
BT
Total number of bytes transferred
B7
Total number of bytes transferred in last 7 days

If you do anything interesting with the preformatted output, I should like to hear about it.


The form interface

Another way to run analog is via the form interface; this allows users to select which options they want via a Web page. Important: I have attempted to block any potential security holes in the form interface. However, cgi programs have been known to contain hidden loopholes, and I can take no responsibility if anything goes wrong. See the licence.

The form interface probably only works on Unix at the moment, though if anyone manages to make it work on other systems, I should be interested to hear about it.

To set up the form interface, go to the directory where the analog source code lives, and follow these steps.

  1. In analhead.h, make sure that the FORMPROG is set to be the URL of the form processing program, which will be wherever cgi-bin programs live on your server; normally in the cgi-bin directory.
  2. Edit the top of analform.c to indicate where the analog program lives (the program name within your computer's whole filespace, not a URL).
  3. Type make form.
  4. Move the program analform.cgi to the place you specified as the FORMPROG. Make sure it is executable by the server.
  5. Make sure analog itself is executable by the server too, and that domains.tab and any configuration files are readable.
  6. The file analogform.html is the actual form interface; move it to wherever you want people to get at it. Make sure it is world readable.

If the third step above fails to generate a form, you can generate one yourself by means of the command analog -form +Oanalogform.html. You might also want to run this command yourself if you want to supply different default options from normal for the form user: if you run the command with extra commandline or configuration file options, they will be respected in the construction of the form.

It is expected that system administrators may want to provide different options on the forms from the default ones. (For example, the form does not by default allow an hourly report). For this reason, the cgi program understands various other options that are not normally on the form. These can be added to the form by hand. For example, you may want to allow a choice of logfiles, perhaps via a <select>. Or you may want form users to use certain default options; these could be specified as <input type=hidden>. Or different users could have different configuration files set up with all their options in which could be read in via a choice of cg's or cm's.

You might want to construct links directly to the analform.cgi program, without going via the form. Because the program uses GET not POST, this is no problem. You can use any of the options below, including cg and cm to read in other commands from a configuration file.

For experts, here follows a complete list of form options. [*] marks a default value (i.e., what is sent to analog if you don't send anything else to the cgi program. These defaults can override the analog program defaults. Items without a [*] use the program default if nothing else is specified).

bq  browser summary? 0 for off, 1 for on.
ba  +ve BROWMINREQS/PAGES
bb  -ve BROWMINREQS/PAGES
bc  +ve BROWMINBYTES
bd  -ve BROWMINBYTES
bs  BROWSORTBY (0 = REQUESTS, 1 = BYTES, 2 = ALPHABETICAL,
                3 = RANDOM, 4 = PAGES)
Other reports similarly with initial B, f, i, o, r, S, t in place of b.
cg  configuration file to read in before all other options
ch  COUNTHOSTS? 0 for off, 1 for on, 2 for approx
cm  configuration file to read in after all other options
cq  status code report?
dq  daily summary?
dg  DAYGRAPH (R, B or P)
Other time reports similarly with D, h, H, m, W in place of d.
eq  error report?
fi  FILEEXCLUDE; list, separated by commas
fr  FROM
fy  FILEINCLUDE; list, separated by commas
gr  GRAPHICAL? 0 for off, 1 for on
hi  HOSTEXCLUDE; list, separated by commas
ho  HOSTURL
hy  HOSTINCLUDE; list, separated by commas
ie  DIRLEVEL
lb  BROWLOG; list, separated by commas
lc  CACHEFILE; list, separated by commas
le  ERRLOG; list, separated by commas
lf  REFLOG; list, separated by commas
lo  LOGFILE; list, separated by commas
or  HOSTNAME
ou  OUTPUT -- 0 for HTML [*], 1 for ASCII, 2 for PREFORMATTED,
              3 for program default
rl  LINKINCLUDE options -- f for ALL (files), p for PAGES, n for NONE
rt  REQINCLUDE options  -- f for ALL, p for PAGES
to  TO
TZ  timezone
Vq  Show commands sent to program
wa  WARNINGS (to error_log) -- 0 for OFF, 1 for ON [*].
xq  general report?
Important note: Do not add options for HEADERFILE and FOOTERFILE to the form program. This would be a potentially serious security risk.

If the form doesn't seem to work, check the following:

  1. Look in the server's error_log for clues.
  2. If you get a long wait, then no data returned, the server is probably timing out the request before analog has finished. The remedy is to increase the timeout interval.
  3. Try including Vq=1 in the search arguments. This will show what's being passed to the program. (It can also be used to generate a configuration file representing your choices).
  4. Do other cgi-bin programs work on your server?
  5. Are all the files in the right places, with the right access permissions, as specified above?
  6. Try the following. setenv QUERY_STRING "xq=1" (C Shell) or export QUERY_STRING="xq=1" (other shells), then run analform from the shell.
  7. If the local time doesn't seem to be correct in the output, you may have to set the timezone yourself in the form. Four lines from the bottom, there is a line like <input type=hidden name="TZ" value=""> For the value you should insert your timezone, in standard format. Usually this looks like your winter timezone name, followed by hours west of Greenwich, followed by your summer timezone name. So the East Coast of the USA should have value="EST5EDT", and Germany value="MEZ-1MESZ".

It is better, although not essential, if when you change the default options for your analog, you remake the form.

Note that you probably want to restrict access to the form and form program to certain users; if it is world readable there could be considerable load on your server as well as potential confidentiality problems. Exactly how to do this depends on which server you are running.


Glossary and reference

Many people have asked exactly what counts as a request, and the meaning of other terms used in the analog output. Here is an explanation. Each time someone requests a file from your server, that is a request. It may be that the page contains some inline images; then they must be loaded separately by people with graphical browsers, which counts as further requests.

Unfortunately, you cannot tell how many times your file has been read from this. The user may in fact request the file from a proxy server which already has a copy of it, or retrieve it from a local cache. In these cases no connection is made to your server, and no request is scored.

There are three categories of request, which can be seen in the status code report. Completed (or successful) requests are those with codes in the 200s (where the document was returned) or with code 304 (where the document was not needed because it had not been recently modified and the user could use a cached copy). Redirected requests are those with other codes in the 300s. The most common cause of these requests is that the user has incorrectly requested a directory name without the trailing slash. The server replies with a redirection ("you probably mean the following") and the user then makes a second connection to get the correct document (although usually the browser does it automatically without the user's intervention or knowledge). Failed requests are those with codes in the 400s (error in request) or 500s (server error). They come about for a variety of reasons, but the most common are when the requested file is not found or is read-protected.

Note that 302 requests are not counted. Most of them come about by the user requesting a faulty URL, as explained above. However, some cgi scripts also return a 302 code; these are not included because there is no way to tell them apart from the first type.

The total data transferred refers only to successful requests, and does not include the message header, only the actual data. The detailed reports also only include successes, except for the referrer report and browser reports which include all request types.

Requests for pages are just those requests corresponding to filenames that analog thinks represent pages. You can tell analog what is to count as a page by means of the ISPAGE command.

Corrupt logfile lines are those we can't understand, and unwanted lines are those that refer to files, hosts or dates that we have specifically excluded. (See the DEBUG command for how to list all corrupt lines).

A host is a computer that has requested something from you. Analog gives the number of distinct (different) hosts that have made a successful request, and the number of distinct files they have requested.

The common logfile format is written by most servers. Its lines look like

m45-6.gps.jussieu.fr - - [14/Mar/1996:17:45:35 +0000]
"GET /~sret1/analog/ HTTP/1.0" 200 12435
(except all on one line). Most of the fields are obvious -- the two numbers at the end are the status code and number of bytes transferred.

The other logfiles are not the same on different servers. Analog understands the files written by the NCSA and Apache (and some other) servers. The browser (or agent) log looks like

[14/Mar/1996:17:45:08] Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)
The referrer log looks like
[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/
In both of these the date may be omitted. The error log looks like
[Thu Jul 28 20:43:10 1994] httpd: successful restart
Analog also understands the NCSA combined log. This looks like the common log, except that it has the referrer and browser on the end, like this:
lion.statslab.cam.ac.uk - - [18/Jan/1996:12:04:23 +0000]
"GET /~sret1/analog/ HTTP/1.0" 200 578 "http://www.statslab.cam.ac.uk/~sret1/"
"Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)"
(except all on one line again). Note the quotes round the referrer and browser. In Apache, you can generate this with the mod_log_config module, using the command
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-Agent}i\""
It is usually better to use the combined log than separate logs, because it stores more information in less space.

The Mac version of analog also reads the WebSTAR and Netpresenz format log files. The WebSTAR file must start with a line like

!!LOG_FORMAT DATE TIME RESULT URL FROM TRANSFER_TIME BYTES_SENT USER HOSTNAME
to tell us what to expect on subsequent lines. (You should add it manually if it's got lost). We require the following fields as a minimum: HOSTNAME, DATE, TIME, RESULT, URL, BYTES_SENT. We will also read AGENT and REFERER if present. Each line of the logfile contains the fields given in the header line, separated by tabs: for example
12/14/95  19:00:04  OK  :pages:Downloads.html  72  3176  indy1.indy.net.
except all on one line.

Here is a complete list of all 176 configuration commands. For their usage, see the documentation above.

Specifying files to analyse
BROWLOG, CACHEFILE, DOMAINSFILE, ERRLOG, LOGFILE, REFLOG
Turning reports on and off
ALL, BROWSER, COUNTHOSTS, DAILY, DIRECTORY, DOMAIN, ERROR, FILETYPE, FULLBROWSER, FULLDAILY, FULLHOSTS, FULLHOURLY, GENERAL, HOURLY, LASTSEVEN, MONTHLY, REFERRER, REQUEST, STATUS, WEEKLY
Columns in reports
BROWCOLS, DAYCOLS, DIRCOLS, DOMCOLS, FULLBROWCOLS, FULLDAYCOLS, FULLHOURCOLS, HOSTCOLS, HOURCOLS, MONTHCOLS, REFCOLS, REQCOLS, TYPECOLS, WEEKCOLS
Number of rows in graphs
FULLDAYROWS, FULLHOURROWS, MONTHROWS, WEEKROWS
Graphs by requests, bytes or pages
DAYGRAPH, FULLDAYGRAPH, FULLHOURGRAPH, HOURGRAPH, MONTHGRAPH, WEEKGRAPH
Graphs forwards or backwards in time
ALLBACK, FULLDAILYBACK, FULLHOURLYBACK, MONTHLYBACK, WEEKLYBACK
Value of unit in graphs
DAILYUNIT, FULLDAILYUNIT, FULLHOURLYUNIT, HOURLYUNIT, MONTHLYUNIT, WEEKLYUNIT
How to sort reports
BROWSORTBY, DIRSORTBY, DOMSORTBY, FULLBROWSORTBY, HOSTSORTBY, REFSORTBY, REQSORTBY, TYPESORTBY
Floors for reports
BROWMINBYTES, BROWMINPAGES, BROWMINREQS, DIRMINBYTES, DIRMINPAGES, DIRMINREQS, DOMMINBYTES, DOMMINPAGES, DOMMINREQS, ERRMINOCCS, FULLBROWMINBYTES, FULLBROWMINPAGES, FULLBROWMINREQS, HOSTMINBYTES, HOSTMINPAGES, HOSTMINREQS, REFMINBYTES, REFMINPAGES, REFMINREQS, REQMINBYTES, REQMINPAGES, REQMINREQS, SUBDOMMINBYTES, SUBDOMMINPAGES, SUBDOMMINREQS, TYPEMINBYTES, TYPEMINPAGES, TYPEMINREQS
Inclusions and exclusions
FILEALLOW, FILEEXCLUDE, FILEINCLUDE, FROM, LINKALLOW, LINKINCLUDE, LINKEXCLUDE, HOSTALLOW, HOSTEXCLUDE, HOSTINCLUDE, REFALLOW, REFEXCLUDE, REFINCLUDE, REFLINKALLOW, REFLINKEXCLUDE, REFLINKINCLUDE, REQALLOW, REQEXCLUDE, REQINCLUDE, TO
Aliases
BROWALIAS, BROWOUTPUTALIAS, DIROUTPUTALIAS, FILEALIAS, HOSTALIAS, HOSTOUTPUTALIAS, REFALIAS, REFOUTPUTALIAS, REQOUTPUTALIAS, TYPEOUTPUTALIAS
Other output configuration
BASEURL, DECPOINT, FOOTERFILE, GRAPHICAL, HEADERFILE, HOSTNAME, HOSTURL, IMAGEDIR, LANGUAGE, LOGOURL, MARKCHAR, OUTPUT, OUTFILE, PAGEWIDTH, PRESEP, RAWBYTES, REPORTORDER, REPSEPCHAR, SEPCHAR
DNS lookups
DNSFILE, DNSFRESHHOURS, NUMLOOKUP
Hash sizes
BROWHASHSIZE, DIRHASHSIZE, DNSHASHSIZE, FULLBROWHASHSIZE, HOSTHASHSIZE, REFHASHSIZE, REQHASHSIZE, SUBDOMHASHSIZE, TYPEHASHSIZE
Other commands
APPROXHOSTSIZE, CASE, CONFIGFILE, DEBUG, DIRLEVEL, DIRSUFFIX, ISPAGE, ISNOTPAGE, NOTSUBDOMAIN, PROGRESSFREQ, REFWITHARGS, REFWITHOUTARGS, SUBDOMAIN, UNCOMPRESS, WARNINGS, WEEKBEGINSON, WITHARGS, WITHOUTARGS

Here is a summary of all 39 commandline arguments. Again, for their usage, see the full documentation. Many of them can be given a - instead of a + to turn something off.

 +1  Do DNS lookup
 +7  stats for last 7 days
 +a  ASCII output
 +A  all reports (except hourly report)
 +b  browser summary
 +B  browser report
 +c  status code report
 +C  configuration command
 +d  daily summary
 +D  daily report
 +e  error report
 +f  referrer report
 +form  do a form
 +F  from
 +g  configuration file
 -G  default config file off
 +help  help message
 +h  hourly summary
 +H  hourly report
 +i  directory report
 +l  dirlevel
 +m  monthly report
 +n  hostname
 +o  domain report
 +O  outfile
 +p  logo
 -q  no warnings
 +r  request report
 +s  host count
 +ss approximate host count
 +S  host report
 +t  file type report
 +T  to
 +u  host url
 +U  cache file
 +v  printvbles
 +V  debug level
 +w  pagewidth
 +W  weekly report
 +x  general summary

Frequently asked questions

See also the glossary above.

When I try and compile analog, it gives me an error.
Look in the Makefile to see if you need to include any extra options. This applies to SunOS 5 (Solaris 2) in particular.
Can analog understand my server's logfile?
The logfiles that analog understands are given in the glossary above. Unfortunately, more and more servers are using proprietary formats, and I just can't write code to parse them all. The best thing is to write a small program, perhaps in Perl, to convert the lines to common format. (The Microsoft Internet Server comes with a program called convlog to do this). Also, tell your server's authors that they should be supporting the common log format.
Is there a version of analog for ftp logfiles?
The same thing applies. There is no standard format. The best thing to do is to convert the logfiles to HTTP common format first.
How can I do such-and-such with a commandline option?
Use the +C option to put any configuration command on the commandline.
How do I get information on just my pages, not everybody's?
Use the FILEINCLUDE command.
Why don't I get such-and-such a report in the output even though I asked for it? (or why don't I get the subdomains I requested in the domain report?)
Maybe the MINREQS or MINBYTES for the report is set too high. For example, if you ask for a host report for all hosts with at least 50 accesses and no host has that many, no report will be produced. See also the next question.
Can I get data on individual visitors to my site?
No. This information is usually not recorded in the logfile. The number of hosts people came from is the best estimate. Even where it is available, there are too many legal as well as ethical issues for me to get involved in it.
Well, can I generate the number of visits then?
Again, no, and any program that claims to do it is lying. Everyone wants to know these two things, but (in the absence of cookies or user identification) that information just isn't available in HTTP. (Some programs count all requests from the same machine as one visit until there is, say, 30 minutes gap. This is a very unsound method of counting visits. It goes wrong if someone leaves and then comes back within 30 minutes. It also goes wrong if several of your readers are from the same site. Both of these happen a lot).
What can I and can't I find out from the logfiles then?
I can recommend some excellent articles on this subject: Getting Real about Usage Statistics by Tim Stehle; Making Sense of Web Usage Statistics by Dana Noonan; and Interpreting WWW Statistics by Doug Linder. For an even more negative view, see Why Web Usage Statistics are (Worse Than) Meaningless by Jeff Goldberg.
Why does the form interface give "Document Returned no Data"?
Probably the server is giving up before the analog process has finished running. Increase the timeout interval on the server.
Can I find out the number of hosts that have accessed each file?
No; it would require storing too much data (all host/file pairs). If there's a particular file you're interested in, use FILEINCLUDE to restrict the analysis to only that file.
How can I run analog every day?
This depends on your particular machine. On Unix, you need to run analog as a cron job (see "man cron"). This is my cron command:
20 1 * * * $HOME/misc/analog
Do I have to save all my old logfiles?
This is answered in the section about cache files above.
What does "Browser logs contain no file information" mean?
The browser log doesn't say which files were requested, so analog cannot restrict that part of the analysis to certain files only. It will still produce a report though. (This is why it is better to use the combined log than separate logs; then the information can be extracted).
How can I specify different logfiles from the form interface?
Just add a new field to the form with name=lo
Can I sort the subdomains in the same way as the domains?
Unfortunately not. This would be the right thing to do if we only allowed one level of subdomains, but when there are several levels without all the intermediate levels being present, it's not clear what the correct order is.
Why are directories listed in the request report?
They are not directories, they are pages with the same name as the directory. For example, I have a directory called /analog/ and a page called /analog/ (which is the same as /analog/index.html).
Why do I only get "unresolved numerical addresses" in the domain report?
Your server only records the numerical IP address of the hosts that contact you, not their names. Read about DNS lookups above.
Can I list browsers by version number, e.g. Mozilla/2.0?
Not really. Different browsers don't report this information in the same way, so it can't be extracted automatically. You can use BROWALIAS to do it for specific browsers, but it's probably not wise because the same version number doesn't mean the same thing on different platforms.
How do I make a link on my page that runs analog?
Link to the analform program, with the desired options.
My server lists local names in the logfile. Can you put a common suffix on them automatically?
This wouldn't be a good idea, because things like "unknown" would get the suffix. You can always add them using HOSTALIAS.
Why don't you make proper graphs or tables?
Because lots of people then couldn't read them. Analog produces HTML output so that people with any browser can read it. Also, I don't want to assume that people have any particular graphics creation tools.
Can I change the background colour of my output?
No: that's not part of HTML either. Analog only writes true HTML.
Can you extrapolate from the current month's partial data to produce a prediction for the whole month, based on the rate so far?
No. There are too many problems in trying to produce anything sensible, especially near the beginning of the month. Different days of the week and different times of day cause lots of problems. I would prefer to produce raw accurate data than suspect derived data.
Can I make multiple reports with one pass through the logfile?
Not at the moment. I want to do this in a future version, but it will require some considerable work.
I ran out of memory when trying to run analog. What can I do?
Try using approximate (instead of exact) hostname counting with the +ss option, or turning hostname counting off altogether with -s. (Note that you also need to have the host report off for this to take effect).
You're processing 1,000,000 requests in under 2 minutes. Why is mine much slower?
If you have DNS lookups on, they are very slow. Otherwise, it probably depends on the speed of your computer and disks. But you could changing the hash sizes.
Why don't you sell analog?
I didn't write analog for the money, and I'm happy just to see people use it. Besides, I haven't got time to support it commercially, and I can't use my academic account for commercial purposes. (Of course, if you really want to send me money, or even postcards...).

Warnings

Lines with filenames longer than a certain limit (which can be specified in analhead.h) are regarded as corrupt lines and discarded.

If we are doing a `top n' report and two entries tie for nth place, only one will be printed.

The reported `running time' is elapsed real time, not CPU time.

You can sort a report by requests even when you have turned off the request columns. This may confuse your readers.

The bytes sometimes aren't reported correctly. This is really a server bug. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred. (Actually, they sometimes do a bit better than that, but it's still likely to be an overestimate). Of course any inaccuracy in the logfile will make analog not tell the truth.


Known bugs

Average requests per day don't take account of daylight savings time changes.

If you find any more bugs, please tell me! Thanks.


Mailing list

I always welcome mail on analog (my e-mail address is sret1@cam.ac.uk); whether it works on your system (yes, even if it does!), any bug reports, patches or requests for new features. It's helpful to me if you include the word "analog" in the subject of your message.

I can no longer keep everyone who writes to me informed about updates. But if you want to be informed about updates, send me mail with "subscribe analog" in the subject line, and I'll make sure to put you on the mailing list. (You can write to me in the body of the message as well; I'll still read it).

I am happy to help people who have trouble with analog, but please read the FAQ first. Also, you might be able to diagnose the problem yourself if you run

analog -v [your usual options]
which lists the value of all variables. But if you still can't get it to work, ask me. It helps me find bugs, and to know where the documentation is unclear. When submitting bug reports, please include the version number (which you can find out by the command analog -v) and what type of computer you're using.

Acknowledgements

Thanks are due to the author of getstats, Kevin Hughes. We (and other people) have found that getstats gets buggy and very slow when the logfile got big, but you may notice that my output (although not my program) is based on his to some extent.

Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Chris Ritson, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice, particularly in making the code faster.

For recent developments I have to thank Dave Stanworth (again!) and all the other people who have provided mirror sites for analog. I have to thank Mark Roedel for the DOS version, Jason Linhart for the Mac version (including lots of new code), Dave Jones for the VMS version, and Dan Anderson for the Win32 version. Stephan Somogyi and Nigel Perry have also contributed code for the Mac version. For the translations into other languages, I have to thank Patrice Lafont (French), Mario Ellebrecht (German), Furio Ercolessi (Italian), Javier Solis (Spanish) and Adrian Price (Danish).


Licence

Analog is copyright (C) Stephen R. E. Turner 1995 - 1997, except those parts written by other people.

This licence describes the conditions under which you may use, modify and distribute version 2.0 of analog ("the program"). Except where stated, the conditions of this licence apply equally to the source code for the program and to any compiled version. If you are unable or unwilling to accept these conditions in full, then, notwithstanding the conditions in the remainder of this licence, you may not use, modify or distribute the program at all. Text in square brackets is intended for guidance only and does not form part of the licence in any way.
[Analog is free software. This licence is designed not to restrict your freedom except insofar as is necessary to ensure that the program remains free for all. Of course, I don't refuse donations!]

  1. Any action which is illegal under international or local law is forbidden by this licence. Any such action is the sole responsibility of the person committing the action.

  2. The program may be used free of charge by any person or organisation to whom it is made available, provided that that person accepts the conditions of this licence.

  3. The program may be copied and distributed by any person or organisation in any way whatsoever, provided that any distribution is accompanied by a copy of the Readme file pertaining to the program. You may not charge for distributing the program without first informing the person to whom it is distributed that analog is free software. Furthermore, you may not charge for distributing a modified version of the program unless the source code for the modified version, or a list of differences between the modified version and the original version, is publicly and freely available in machine readable form.
    [A copy of the Readme may be downloaded from the analog home page, currently at http://www.statslab.cam.ac.uk/~sret1/analog/ and mirrored at http://www.gamesdomain.com/analog/ and other sites. You are strongly encouraged to distribute only complete distributions. If you distribute analog with a book or something like that, I'd be pleased if you wanted to send me a copy].

  4. You may make a reasonable charge for either of the following services, provided in each case that the third party is made aware that analog is free software and that the charge is therefore for your labour, expertise and costs.
    1. Installing the program on a computer on behalf of a third party;
    2. Running the program and providing output from it to a third party.
    You may not charge for these services in connection with a modified version of the program unless the source code for the modified version, or a list of differences between the modified version and the original version, is publicly and freely available in machine readable form.

  5. You may modify the program in any way you wish provided that all of the following conditions are met:
    1. Any modification in the source code is clearly marked as such;
    2. No modification is made to the Readme file;
      [Any documentation needed on your changes must therefore be made in a separate file].
    3. The VNUMBER string is changed to "2.0(modified)";
      [This string can be found near the top of analhea2.h].
    4. You ensure that the HTML2.0 logo is never produced except when the output is HTML2.0 conformant;
      [This logo is produced near the bottom of output.c.]
    5. All of the conditions of this licence, and no other conditions, apply to your modified version.
    You may claim copyright for the parts of the program you have written.
    [You are encouraged to submit your changes to me for inclusion in subsequent versions of analog].

  6. No warranty of any sort, expressed or implied, is provided in connection with the program, including, but not limited to, implied warranties of merchantibility or fitness for a particular purpose. Any cost, loss or damage of any sort incurred owing to the malfunction or misuse of the program or the inaccuracy of the documentation or connected with the program in any other way whatsoever is the responsibility of the person who incurred the cost, loss or damage. By using this program you give up any right to seek any damages against me in connection with this program.
    [I don't guarantee that the program doesn't contain bugs].

  7. I, Stephen Turner, reserve the right to make exceptions to any of these conditions, or alter these conditions, at any time. However, you may always use these conditions instead of any altered version if you prefer.
    [Note that this licence explicitly applies only to one version of analog. Therefore, if I make new conditions in connection with a future version, you do not then have the right to apply these conditions to that version instead].

Stephen R. E. Turner
Statistical Laboratory,
University of Cambridge
sret1@cam.ac.uk
5th February 1997

Stephen Turner
University of Cambridge Statistical Laboratory
E-mail: sret1@cam.ac.uk

Page last modified: 05-Feb-97