Projections

Today I’d like to present a little Python script that extrapolates Webalizer reports. Webalizer is a “fast, free web server log file analysis program” that my webhost uses to produce HTML traffic reports. These reports include month-by-month totals for the last 12 months. Unfortunately, the numbers for the current (partial) month aren’t directly comparable to those for previous months. This script grabs the summary report and does a linear extrapolation on the most recent month. It’s a bit silly, but I like the apples-to-apples comparison.

Fetch and Parse

The first thing to do is to fetch the summary report, and then extract the relevant data from it. The parsing code you see here is ridiculously specific to the version and configuration of Webalizer that my host runs. It’d be no big deal to change it if necessary, and this is much much faster than trying to be general. Here’s the code:

def get_report(url):
    fp = urllib2.urlopen(url)
    s = fp.read()
    fp.close()
    return s

def get_gen_date(s):
    t = re.search('Generated\s+(.+)<BR>', s).group(1)
    return time.strptime(t, '%d-%B-%Y %H:%M %Z')

def get_table_row(s, i):
    cells = re.findall('<TD.+<FONT.+>(.+)</FONT>.*</TD>', s)
    return cells[i*13+0:i*13+1] + cells[i*13+5:i*13+6] + cells[i*13+8:i*13+13]

I use regexes to search for certain “magic” HTML patterns. I assume that the results table is 13 columns wide, and that the data I want are in certain specific columns:

  • Column 0: Month
  • Column 5: Uniques
  • Column 8: KB Out
  • Column 9: Visits
  • Column 10: Pages
  • Column 11: Files
  • Column 12: Hits

Extrapolation

To do my linear extrapolation, I need to compare the time between report generation and the start of the month against the length of the entire month. This is a little fiddly, but not too bad:

def get_factor(t):
    if (t[1] == 12):
        som = time.mktime((t[0], t[1], 1, 0, 0, 0, 0, 0, -1))
        eom = time.mktime((t[0]+1, 1, 1, 0, 0, 0, 0, 0, -1))
    else:
        som = time.mktime((t[0], t[1], 1, 0, 0, 0, 0, 0, -1))
        eom = time.mktime((t[0], t[1]+1, 1, 0, 0, 0, 0, 0, -1))
    return (eom-som)/(time.mktime(t)-som)

Here, t is a Python time.struct_time sequence. Its first three elements are year, month (1 to 12), and day. Its last element is a DST indicator; a value of -1 means “do the right thing”. DST might not be handled 100% correctly in some cases. Boo hoo.

Output

Okay, now we can dump some output to the console.

def output(t, r, f):
    print 'Built on', time.strftime('%d-%B-%Y %H:%M %Z', t), 'for', r[0]
    print

    d = [int(f*int(i)) for i in r[1:]]
    print 'Projected Uniques/Visits:   %d / %d' % (d[0],d[2])
    print 'Projected Pages/Files/Hits: %d / %d / %d' % tuple(d[-3:])
    print 'Projected GB Served:        %.3f' % float(d[1]/1000000.0)

Note that:

  • t is a time.struct_time returned by get_gen_date()
  • r is a tuple returned by get_table_row()
  • f is a float returned by get_factor()

Make It Go

Here’s the main function that co-ordinates all the parts:

def report():
    s = get_report(PUT_AN_APPROPRIATE_URL_HERE)
    t = get_gen_date(s)
    f = get_factor(t)

    if (f < 60):
        output(t, get_table_row(s, 0), f)
    else:
        output(t, get_table_row(s, 1), 1)

Please note that when I don’t have enough data for the current month I return non-extrapolated data from the previous month. I set the threshold at a factor of 60, or about 12 hours of data. (This code assumes that there is a previous month of data.)

Code

Here’s the complete script:

#!/usr/bin/python

import  urllib2
import  time
import  re


def get_report(url):
    fp = urllib2.urlopen(url)
    s = fp.read()
    fp.close()
    return s


def get_gen_date(s):
    t = re.search('Generated\s+(.+)<BR>', s).group(1)
    return time.strptime(t, '%d-%B-%Y %H:%M %Z')

def get_table_row(s, i):
    cells = re.findall('<TD.+<FONT.+>(.+)</FONT>.*</TD>', s)
    return cells[i*13+0:i*13+1] + cells[i*13+5:i*13+6] + cells[i*13+8:i*13+13]

def get_factor(t):
    if (t[1] == 12):
        som = time.mktime((t[0], t[1], 1, 0, 0, 0, 0, 0, -1))
        eom = time.mktime((t[0]+1, 1, 1, 0, 0, 0, 0, 0, -1))
    else:
        som = time.mktime((t[0], t[1], 1, 0, 0, 0, 0, 0, -1))
        eom = time.mktime((t[0], t[1]+1, 1, 0, 0, 0, 0, 0, -1))
    return (eom-som)/(time.mktime(t)-som)


def output(t, r, f):
    print 'Built on', time.strftime('%d-%B-%Y %H:%M %Z', t), 'for', r[0]
    print

    d = [int(f*int(i)) for i in r[1:]]
    print 'Projected Uniques/Visits:   %d / %d' % (d[0],d[2])
    print 'Projected Pages/Files/Hits: %d / %d / %d' % tuple(d[-3:])
    print 'Projected GB Served:        %.3f' % float(d[1]/1000000.0)

def report():
    s = get_report(PUT_AN_APPROPRIATE_URL_HERE)
    t = get_gen_date(s)
    f = get_factor(t)

    if (f < 60):
        output(t, get_table_row(s, 0), f)
    else:
        output(t, get_table_row(s, 1), 1)


if __name__ == '__main__':
    report()
Share and Enjoy:
  • Twitter
  • Facebook
  • Digg
  • Reddit
  • HackerNews
  • del.icio.us
  • Google Bookmarks
  • Slashdot
This entry was posted in Python, Web stuff. Bookmark the permalink.

Comments are closed.