Table of Contents:
dbm files are the first implementations of the databases, which originated on Unix systems, and currently being used in many Unix applications where simple key-value pairs should be stored and manipulated. As of this writing Berkeley DB is the most powerful dbm implementation. If you need a light database, with easy API to work with this is a solution that should be considered as a first one. Of course only if you are sure the DB you are going to use will stay small, I would say under 5000-10000 records, but it depends on your hardware, which can rise and lower the numbers above. It is a much better solution over the flat file databases which become pretty slow on insert, update and delete operations when the number of records grows beyond 1000. The situation is even worse when we need to run sort on this kind of DB.
dbm files are being manipulated much faster than their flat file brothers,
since almost never the whole DB is being read into a memory and because of
smart storage technique. You can use a HASH
algorithm which allows a 0(1)
complexity of search and update, fast insert and delete, but slow sort,
since you have to do it yourself. BTREE
allows arbitrary key/value pairs to be stored in a sorted, balanced binary
tree, which allows us to get a sorted sequence of data pairs in 0(1)
, but much slower insert, update, delete operations. RECNO
algorithm is more complicated one, and enables for both fixed-length and
variable-length flat text files to be manipulated using the same key/value
pair interface as in HASH
and
BTREE
. In this case the key will consist of a record (line) number. Most chances
you will want to use the HASH
format, but your choice is very dependent on a kind of your application.
dbm databases are not limited for key and value pairs storages, but can store
more complicated structures with help of MLDBM
module. Which can dump and restore the whole symbol table of your script,
including arrays, hashes and other complicated data HASH
structures.
Another important thing to say, is that you cannot convert a dbm file from one storing algorithm to another, by simply tying it using a wanted format. The only way is to dump it into a flat file and then restore it using a new format. You can use a script like:
#!/usr/bin/perl -w # # This script gets as a parameter a Berkeley DB file(s) which is stored # with DB_BTREE algorithm, and will backup it with .bak and create # instead the db with the same records but stored with DB_HASH # algorithm # # Usage: btree2hash.pl filename(s) use strict; use DB_File; use File::Copy; # Do checks die "Usage: btree2hash.pl filename(s))\n" unless @ARGV; foreach my $filename (@ARGV) { die "Can't find $filename: $!\n" unless -e $filename and -r $filename; # First backup the filename move("$filename","$filename.btree") or die "can't move $filename $filename.btree:$!\n"; my %hash; my %btree; # tie both dbs (db_hash is a fresh one!) tie %btree , 'DB_File',"$filename.btree", O_RDWR|O_CREAT, 0660, $DB_BTREE or die "Can't tie %btree"; tie %hash , 'DB_File',"$filename" , O_RDWR|O_CREAT, 0660, $DB_HASH or die "Can't tie %hash"; # copy DB %hash = %btree; # untie untie %btree ; untie %hash ; }
Note that some dbm implementations come with other conversion utilities as well.
Where mod_perl enters into a picture? If you are using a read only dbm file
you can have it work faster if you keep it open (tied) all the time, so
when your CGI script wants to access the database it is already tied and
ready to be used. It will work as well with your dynamic dbm databases as
well but you need to use locking to avoid data corruptions. Of course this
feature can make a huge speedup to your CGIs, but you should be very
careful. What should be taken into account is a db locking, handling
possible die()
cases and child quits. A stale lock can deactivate your whole site, if your
locking mechanism cannot handle dropped locks. You can enter a deadlock
situations if 2 processes are trying to acquire locks on 2 databases, but
get stuck because each has got hands on one of the 2 databases, and to
release it, each process needs the second one, which will never be freed,
because that is the condition for the first one to be released (possible
only if processes do not all ask for their DB files in the same order). If
you modify the DB you should be very careful to flush and synchronize it,
especially when your CGI unexpectedly dies. In general your application
should be tested very thoroughly before you put it into production to
handle important data.
Let's have a lock status as a global variable, so it will persist from request to request. If we are requesting a lock - READ (shared) or WRITE (exclusive), the current lock status is being obtained first.
If we get a READ lock request, it is granted as soon as file becomes or is locked or already locked for READ. Lock status is READ now.
If we get a WRITE lock request, it is granted as soon as file becomes or is unlocked. Lock status is WRITE now.
What happens to the WRITE lock request, is the most important. If the DB is being READ locked, request that request to write will poll until there will be no reading or writing process left. Lots of processes can successfully read the file, since they do not block each other from doing so. This means that a process that wants to write to the file (first obtaining an exclusive lock) never gets a chance to squeeze in. The following diagram represents a possible scenario where everybody read but no one can write:
[-p1-] [--p1--] [--p2--] [---------p3---------] [------p4-----] [--p5--] [----p5----]
So you get a starving process, which most certainly will timeout the request, and the DB will be not updated.
So you have another reason not to cache the dbm handle with dynamic dbm
files. But it will work perfect with the static DBM files without a need to
lock files at all. Ken Williams solved the above problem in his Tie::DB_Lock
module, and I will present it in the next section.
Tie::DB_Lock
- ties hashes to databases using shared and exclusive locks. A module by
Ken Williams. which solves the problem raised in the previous section.
The main difference from what I have described before is that
Tie::DB_Lock
copies a dbm file on read so that reader processes do not have to keep the
file locked while they read it, and writers can still access it while
others are reading. It works best when you have lots of long-duration
reading, and a few short bursts of writing.
The drawback of this module is a heavy IO performed when every reader makes a fresh copy of the DB. With big dbm files this can be quite a disadvantage and slowdown. An improvement that can cut a number of files that are being copied, would be to have only one copy of the dbm image that will be shared by all the reader processes. So it would put the responsibility of copying the read-only file on the writer, not the reader. It would take some care to make sure it does not disturb readers when putting a new read-only copy into place.
I have discussed what can be achieved with mod_perl and dbm files, the cons
and pros. Now it is a time to show some code. I wrote a simple wrapper for
a DB_File
module, and extended it to handle locking, and proper exits. Note that this
code still demands some testing, so be careful if you use it on your
production machine as is.
So the DB_File::Wrap
(note that you will not find it on CPAN):
package DB_File::Wrap; require 5.004; use strict; BEGIN { # RCS/CVS complient: must be all one line, for MakeMaker $DB_File::Wrap::VERSION = do { my @r = (q$Revision: 1.1.1.1 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r }; } use DB_File; use Fcntl qw(:flock O_RDWR O_CREAT); use Carp qw(croak carp verbose); use IO::File; use vars qw($debug); #$debug = 1; $debug = 0; # my $db = DB_File::Wrap \%hash, $filename, [lockmode]; # from now one we can work with both %hash (tie API) and $db (direct API) ######### sub new{ my $class = shift; my $hr_hash = shift; my $file = shift; my $lock_mode = shift || ''; my $db_type = shift || 'HASH'; my $self; $self = bless { db_type => 'DB_File', flags => O_RDWR|O_CREAT, mode => 0660, hash => $hr_hash, how => $DB_HASH, }, $class ; # by default we tie with HASH alg and if requested with BTREE $self->{'how'} = ($db_type eq 'BTREE') ? $DB_BTREE : $DB_HASH; # tie the object $self->{'db_obj'} = tie %{$hr_hash}, $self->{'db_type'},$file, $self->{'flags'},$self->{'mode'}, $self->{'how'} or croak "Can't tie $file:$!\n"; ; my $fd = $self->{'db_obj'}->fd; croak "Can't get fd :$!" unless defined $fd and $fd; $self->{'fh'}= new IO::File "+<&=$fd" or croak "[".__PACKAGE__."] Can't dup: $!"; # set the lock status to unlocked $self->{'lock'} = 0; # do the lock here if requested $self->lock($lock_mode) if $lock_mode; return $self; } # end of sub new # lock the fd either exclusive or shared lock (write/read) # default is read (shared) ########### sub lock{ my $self = shift; my $lock_mode = shift || 'read'; # lock codes: # 0 == not locked # 1 == read locked # 2 == write locked if ($lock_mode eq 'write') { # Get the exclusive write lock unless (flock ($self->{'fh'}, LOCK_EX | LOCK_NB)) { unless (flock ($self->{'fh'}, LOCK_EX)) { croak "exclusive flock: $!" } } # save the status of lock $self->{'lock'} = 2; } elsif ($lock_mode eq 'read'){ # Get the shared read lock unless (flock ($self->{'fh'}, LOCK_SH | LOCK_NB)) { unless (flock ($self->{'fh'}, LOCK_SH)) { croak "shared flock: $!" } } # save the status of lock $self->{'lock'} = 1; } else { # incorrect mode carp "Can't lock. Unknown mode: $lock_mode\n"; } } # end of sub lock # unlock ########### sub unlock{ my $self = shift; $self->{'db_obj'}->sync() if defined $self->{'db_obj'}; # flush flock($self->{'fh'}, LOCK_UN); $self->{'lock'} = 0; } # untie the hash # and close all the handlers # if wasn't unlocked, end() will unlock as well ########### sub end{ my $self = shift; # unlock if stilllocked $self->unlock() if $self->{'lock'}; delete $self->{'db_obj'} if $self->{'db_obj'}; untie %{$self->{'hr_hash'}} if $self->{'hr_hash'}; $self->{'fh'}->close if $self->{'fh'}; } # DESTROY makes all kinds of cleanups if the fuctions were interuppted # before their completion and haven't had a chance to make a clean up. ########### sub DESTROY{ my $self = shift; # just to be sure that we properly closed everything $self->end(); print "Destroying ".__PACKAGE__."\n" if $debug; undef $self if $self; } #### END { print "Calling the END from ".__PACKAGE__."\n" if $debug; } 1;
And you use it :
use DB_File::Wrap ();
A simple tie, READ lock and untie
my $dbfile = "/tmp/test"; my %mydb = (); my $db = new DB_File::Wrap \%mydb, $dbfile, 'read'; print $mydb{'stas'} if exists $mydb{'stas'}; # sync and untie $db->end();
You can even skip the end()
call, if leave the scope $db
defined in:
sub user_exists{ my $user = shift; my $result = 0; my %mydb = (); my $db = new DB_File::Wrap \%mydb, $dbfile, 'read'; # if we match the username return 1 $result = 1 if $mydb{$user}; $result; } # end of sub user_exists
Perform both, read and write operations:
my $dbfile = "/tmp/test"; my %mydb = (); my $db = new DB_File::Wrap \%mydb, $dbfile; print $mydb{'stas'} if exists $mydb{'stas'}; # lock the db, we gonna change it! $db->lock('write'); $mydb{'stas'} = 1; # unlock the db for write # sync and untie $db->end();
If your CGI was interrupted in the middle, DESTROY
block will worry to unlock the dbm file and flush the changes. Note that I
have got db corruptions even with this code on huge dbm files 10000+
records, so be careful when you use it. I thought that I have covered all
the possible failures but seems that not all of them. At the end I have
moved everything to work with mysql. So if you figure out where the problem is you are very welcome to tell me
about it.
|
||
Written by Stas Bekman.
Last Modified at 09/25/1999 |
![]() |
Use of the Camel for Perl is a trademark of O'Reilly & Associates, and is used by permission. |