Mod Perl Icon Mod Perl Icon mod_perl and dbm files


[ Prev | Main Page | Next ]

Table of Contents:


The Writing Apache Modules with Perl and C book can be purchased online from O'Reilly and Amazon.com.
Your corrections of either technical or grammatical errors are very welcome. You are encouraged to help me to improve this guide. If you have something to contribute please send it directly to me.

[TOC]


Where and Why to use dbm files

dbm files are the first implementations of the databases, which originated on Unix systems, and currently being used in many Unix applications where simple key-value pairs should be stored and manipulated. As of this writing Berkeley DB is the most powerful dbm implementation. If you need a light database, with easy API to work with this is a solution that should be considered as a first one. Of course only if you are sure the DB you are going to use will stay small, I would say under 5000-10000 records, but it depends on your hardware, which can rise and lower the numbers above. It is a much better solution over the flat file databases which become pretty slow on insert, update and delete operations when the number of records grows beyond 1000. The situation is even worse when we need to run sort on this kind of DB.

dbm files are being manipulated much faster than their flat file brothers, since almost never the whole DB is being read into a memory and because of smart storage technique. You can use a HASH algorithm which allows a 0(1) complexity of search and update, fast insert and delete, but slow sort, since you have to do it yourself. BTREE allows arbitrary key/value pairs to be stored in a sorted, balanced binary tree, which allows us to get a sorted sequence of data pairs in 0(1), but much slower insert, update, delete operations. RECNO algorithm is more complicated one, and enables for both fixed-length and variable-length flat text files to be manipulated using the same key/value pair interface as in HASH and BTREE. In this case the key will consist of a record (line) number. Most chances you will want to use the HASH format, but your choice is very dependent on a kind of your application.

dbm databases are not limited for key and value pairs storages, but can store more complicated structures with help of MLDBM module. Which can dump and restore the whole symbol table of your script, including arrays, hashes and other complicated data HASH structures.

Another important thing to say, is that you cannot convert a dbm file from one storing algorithm to another, by simply tying it using a wanted format. The only way is to dump it into a flat file and then restore it using a new format. You can use a script like:

  #!/usr/bin/perl -w
  
  #
  # This script gets as a parameter a Berkeley DB file(s) which is stored
  # with DB_BTREE algorithm, and will backup it with .bak and create
  # instead the db with the same records but stored with DB_HASH
  # algorithm
  #
  # Usage: btree2hash.pl filename(s)
  
  use strict;
  use DB_File;
  use File::Copy;
  
    # Do checks 
  die "Usage: btree2hash.pl filename(s))\n" unless @ARGV;
  
  foreach my $filename (@ARGV) {
  
    die "Can't find $filename: $!\n" unless -e $filename and -r $filename;
  
      # First backup the filename
    move("$filename","$filename.btree") 
      or die "can't move $filename $filename.btree:$!\n";
  
    my %hash;
    my %btree;
  
      # tie both dbs (db_hash is a fresh one!)
    tie %btree , 'DB_File',"$filename.btree", O_RDWR|O_CREAT, 
        0660, $DB_BTREE or die "Can't tie %btree";
    tie %hash ,  'DB_File',"$filename" , O_RDWR|O_CREAT, 
        0660, $DB_HASH  or die "Can't tie %hash";
  
      # copy DB
    %hash = %btree;
  
      # untie
    untie %btree ;
    untie %hash ;
  }

Note that some dbm implementations come with other conversion utilities as well.

[TOC]


mod_perl and dbm

Where mod_perl enters into a picture? If you are using a read only dbm file you can have it work faster if you keep it open (tied) all the time, so when your CGI script wants to access the database it is already tied and ready to be used. It will work as well with your dynamic dbm databases as well but you need to use locking to avoid data corruptions. Of course this feature can make a huge speedup to your CGIs, but you should be very careful. What should be taken into account is a db locking, handling possible die() cases and child quits. A stale lock can deactivate your whole site, if your locking mechanism cannot handle dropped locks. You can enter a deadlock situations if 2 processes are trying to acquire locks on 2 databases, but get stuck because each has got hands on one of the 2 databases, and to release it, each process needs the second one, which will never be freed, because that is the condition for the first one to be released (possible only if processes do not all ask for their DB files in the same order). If you modify the DB you should be very careful to flush and synchronize it, especially when your CGI unexpectedly dies. In general your application should be tested very thoroughly before you put it into production to handle important data.

[TOC]


Locking dbm handlers

Let's have a lock status as a global variable, so it will persist from request to request. If we are requesting a lock - READ (shared) or WRITE (exclusive), the current lock status is being obtained first.

If we get a READ lock request, it is granted as soon as file becomes or is locked or already locked for READ. Lock status is READ now.

If we get a WRITE lock request, it is granted as soon as file becomes or is unlocked. Lock status is WRITE now.

What happens to the WRITE lock request, is the most important. If the DB is being READ locked, request that request to write will poll until there will be no reading or writing process left. Lots of processes can successfully read the file, since they do not block each other from doing so. This means that a process that wants to write to the file (first obtaining an exclusive lock) never gets a chance to squeeze in. The following diagram represents a possible scenario where everybody read but no one can write:

  [-p1-]                 [--p1--]
     [--p2--]
   [---------p3---------]
                 [------p4-----]
     [--p5--]   [----p5----]

So you get a starving process, which most certainly will timeout the request, and the DB will be not updated.

So you have another reason not to cache the dbm handle with dynamic dbm files. But it will work perfect with the static DBM files without a need to lock files at all. Ken Williams solved the above problem in his Tie::DB_Lock module, and I will present it in the next section.

[TOC]


Tie::DB_Lock

Tie::DB_Lock - ties hashes to databases using shared and exclusive locks. A module by Ken Williams. which solves the problem raised in the previous section.

The main difference from what I have described before is that Tie::DB_Lock copies a dbm file on read so that reader processes do not have to keep the file locked while they read it, and writers can still access it while others are reading. It works best when you have lots of long-duration reading, and a few short bursts of writing.

The drawback of this module is a heavy IO performed when every reader makes a fresh copy of the DB. With big dbm files this can be quite a disadvantage and slowdown. An improvement that can cut a number of files that are being copied, would be to have only one copy of the dbm image that will be shared by all the reader processes. So it would put the responsibility of copying the read-only file on the writer, not the reader. It would take some care to make sure it does not disturb readers when putting a new read-only copy into place.

[TOC]


Code snippets

I have discussed what can be achieved with mod_perl and dbm files, the cons and pros. Now it is a time to show some code. I wrote a simple wrapper for a DB_File module, and extended it to handle locking, and proper exits. Note that this code still demands some testing, so be careful if you use it on your production machine as is.

So the DB_File::Wrap (note that you will not find it on CPAN):

  package DB_File::Wrap;
  require 5.004;
  
  use strict;
  
  BEGIN {
      # RCS/CVS complient:  must be all one line, for MakeMaker
    $DB_File::Wrap::VERSION = do { my @r = (q$Revision: 1.1.1.1 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
  
  }
  
  use DB_File;
  use Fcntl qw(:flock O_RDWR O_CREAT);
  use Carp qw(croak carp verbose);
  use IO::File;
  
  use vars qw($debug);
  
  #$debug = 1;
  $debug = 0;
  
  # my $db = DB_File::Wrap \%hash, $filename, [lockmode];
  # from now one we can work with both %hash (tie API) and $db (direct API)
  #########
  sub new{
    my $class     = shift;
    my $hr_hash   = shift;
    my $file      = shift;
    my $lock_mode = shift || '';
    my $db_type   = shift || 'HASH';
  
    my $self;
    $self = bless {
                 db_type => 'DB_File',
                 flags   => O_RDWR|O_CREAT,
                 mode    => 0660,
                 hash    => $hr_hash,
                 how     => $DB_HASH,
                }, $class ;
  
      # by default we tie with HASH alg and if requested with BTREE
    $self->{'how'} = ($db_type eq 'BTREE') ? $DB_BTREE : $DB_HASH;
  
      # tie the object
    $self->{'db_obj'} = tie %{$hr_hash},
      $self->{'db_type'},$file, $self->{'flags'},$self->{'mode'}, $self->{'how'}
        or croak "Can't tie $file:$!\n"; ;
  
    my $fd = $self->{'db_obj'}->fd;
    croak "Can't get fd :$!" unless defined $fd and $fd;
    $self->{'fh'}= new IO::File "+<&=$fd" or croak "[".__PACKAGE__."] Can't dup: $!";
  
      # set the lock status to unlocked
    $self->{'lock'} = 0;
  
      # do the lock here if requested
    $self->lock($lock_mode) if $lock_mode;
  
    return $self;
  
  } # end of sub new
  
  
  # lock the fd either exclusive or shared lock (write/read)
  # default is read (shared)
  ###########
  sub lock{
    my $self      = shift;
    my $lock_mode = shift || 'read';
  
  # lock codes:
  # 0 == not   locked
  # 1 == read  locked
  # 2 == write locked
  
    if ($lock_mode eq 'write') {
        # Get the exclusive write lock
      unless (flock ($self->{'fh'}, LOCK_EX | LOCK_NB)) {
        unless (flock ($self->{'fh'}, LOCK_EX)) { croak "exclusive flock: $!" }
      }
        # save the status of lock
      $self->{'lock'} = 2;
  
    } elsif ($lock_mode eq 'read'){
        # Get the shared read lock
      unless (flock ($self->{'fh'}, LOCK_SH | LOCK_NB)) {
        unless (flock ($self->{'fh'}, LOCK_SH)) { croak "shared flock: $!" }
      }
        # save the status of lock
      $self->{'lock'} = 1;
    } else {
        # incorrect mode
      carp "Can't lock. Unknown mode: $lock_mode\n";
    }
  
  } # end of sub lock
  
  # unlock 
  ###########
  sub unlock{
    my $self = shift;
  
    $self->{'db_obj'}->sync() if defined $self->{'db_obj'};   # flush
    flock($self->{'fh'}, LOCK_UN);
    $self->{'lock'} = 0;
  }
  
  # untie the hash
  # and close all the handlers
  # if wasn't unlocked, end() will unlock as well
  ###########
  sub end{
    my $self = shift;
  
      # unlock if stilllocked
    $self->unlock() if $self->{'lock'};
  
    delete $self->{'db_obj'}    if $self->{'db_obj'};
    untie %{$self->{'hr_hash'}} if $self->{'hr_hash'};
    $self->{'fh'}->close        if $self->{'fh'};
  
  }
  
  
  # DESTROY makes all kinds of cleanups if the fuctions were interuppted
  # before their completion and haven't had a chance to make a clean up.
  ###########
  sub DESTROY{
    my $self = shift;
  
      # just to be sure that we properly closed everything
    $self->end();
  
    print "Destroying ".__PACKAGE__."\n" if $debug;
    undef $self if $self;
  
  }
  
  ####
  END {
  
    print "Calling the END from ".__PACKAGE__."\n" if $debug;
  
  }
  
  1;

And you use it :

  use DB_File::Wrap ();

A simple tie, READ lock and untie

  my $dbfile = "/tmp/test";
  my %mydb = ();
  my $db = new DB_File::Wrap \%mydb, $dbfile, 'read';
  print $mydb{'stas'} if exists $mydb{'stas'};
    # sync and untie
  $db->end();

You can even skip the end() call, if leave the scope $db defined in:

  sub user_exists{
    my $user = shift;
    my $result = 0;
  
    my %mydb = ();
    my $db = new DB_File::Wrap \%mydb, $dbfile, 'read';
  
    # if we match the username return 1
    $result = 1 if $mydb{$user};
  
    $result;
  } # end of sub user_exists

Perform both, read and write operations:

  my $dbfile = "/tmp/test";
  my %mydb = ();
  my $db = new DB_File::Wrap \%mydb, $dbfile;
  print $mydb{'stas'} if exists $mydb{'stas'};
  
    # lock the db, we gonna change it!
  $db->lock('write');
  $mydb{'stas'} = 1;
    # unlock the db for write
  
    # sync and untie
  $db->end();

If your CGI was interrupted in the middle, DESTROY block will worry to unlock the dbm file and flush the changes. Note that I have got db corruptions even with this code on huge dbm files 10000+ records, so be careful when you use it. I thought that I have covered all the possible failures but seems that not all of them. At the end I have moved everything to work with mysql. So if you figure out where the problem is you are very welcome to tell me about it.

[TOC]


The Writing Apache Modules with Perl and C book can be purchased online from O'Reilly and Amazon.com.
Your corrections of either technical or grammatical errors are very welcome. You are encouraged to help me to improve this guide. If you have something to contribute please send it directly to me.
[ Prev | Main Page | Next ]

Written by Stas Bekman.
Last Modified at 09/25/1999
Mod Perl Icon Use of the Camel for Perl is
a trademark of O'Reilly & Associates,
and is used by permission.