Mod Perl Icon Mod Perl Icon Things obvious to others, but not to you


[ Prev | Main Page | Next ]

Table of Contents:


The Writing Apache Modules with Perl and C book can be purchased online from O'Reilly and Amazon.com.
Your corrections of either technical or grammatical errors are very welcome. You are encouraged to help me to improve this guide. If you have something to contribute please send it directly to me.

[TOC]


Coverage

This document describes ``special'' traps you may encounter when running your plain CGIs under Apache::Registry and Apache::PerlRun.

[TOC]


my() scoped variable in nested subroutines

[TOC]


The poison

In a non modperl script (stand alone or CGI), there is no problem writing code like this:

    use CGI qw/param/;
    my $x = param('x');
    sub printit {
       print "$x\n";
    }

However, the script is run under Apache::Registry, it will in fact be repackaged into something like this:

  package $mangled_package_name;
  sub handler {
    #line1 $original_filename
    use CGI qw/param/;
    my $x = param('x');
    sub printit {
       print "$x\n";
    }
  }

Now printit() is an inner named subroutine. Because it is referencing a lexical variable from an enclosing scope, a closure is created.

The first time the script is run, the correct value of $x will be printed. However on subsequent runs, printit() will retain the initial value of $x -- not what you want.

[TOC]


The diagnosis

Always use -w (or/and PerlWarn ON)! Perl will then emit a warning like:

  Value of $x will not stay shared at - line 5.

NOTE: Subroutines defined inside BEGIN{} and END{} cannot trigger this message, since each BEGIN{} and END{} is defined to be called exactly once. (To understand why, read about the closures at perlref or perlfaq 13.12)

PERLDIAG manpage says:

  An inner (nested) named subroutine is referencing a lexical variable
  defined in an outer subroutine.

When the inner subroutine is called, it will probably see the value of the outer subroutine's variable as it was before and during the *first* call to the outer subroutine; in this case, after the first call to the outer subroutine is complete, the inner and outer subroutines will no longer share a common value for the variable. In other words, the variable will no longer be shared.

Check your code by running Apache in single-child mode (httpd -X). Since the value of a my variable retain its initial value per child process, the closure problem can be difficult to track down in multi-user mode. It will appear to work fine until you have cycled through all the httpd children.

[TOC]


The remedy

If a variable needs file scope, use a global variable:

    use vars qw/$x/;
    use CGI qw/param/;
    $x = param('x');
    sub printit {
       print "$x\n";
    }

You can safely use a my() scoped variable if its value is constant:

    use vars qw/$x/;
    use CGI qw/param/;
    $x = param('x');
    my $y = 5;
    sub printit {
       print "$x, $y\n";
    }

Also see the clarification of my() vs. use vars - Ken Williams writes:

  Yes, there is quite a bit of difference!  With use vars(), you are
  making an entry in the symbol table, and you are telling the
  compiler that you are going to be referencing that entry without an
  explicit package name.
  
  With my(), NO ENTRY IS PUT IN THE SYMBOL TABLE.  The compiler
  figures out _at_ _compile_time_ which my() variables (i.e. lexical
  variables) are the same as each other, and once you hit execute time
  you can not go looking those variables up in the symbol table.

And my() vs. local() - Randal Schwartz writes:

  local() creates a temporal-limited package-based scalar, array,
  hash, or glob -- when the scope of definition is exited at runtime,
  the previous value (if any) is restored.  References to such a
  variable are *also* global... only the value changes.  (Aside: that
  is what causes variable suicide. :)
  
  my() creates a lexically-limited non-package-based scalar, array, or
  hash -- when the scope of definition is exited at compile-time, the
  variable ceases to be accessible.  Any references to such a variable
  at runtime turn into unique anonymous variables on each scope exit.

[TOC]


Additional reading references

For more information see: Using global variables and sharing them between modules/packages and an article by Mark-Jason Dominus about how Perl handles variables and namespaces, and the difference between use vars() and my() - http://www.plover.com/~mjd/perl/FAQs/Namespaces.html .

[TOC]


Compiled Regular Expressions

When using a regular expression that contains an interpolated Perl variable, if it is known that the variable (or variables) will not vary during the execution of the program, a standard optimization technique consists of adding the /o modifier to the regexp pattern. This directs the compiler to build the internal table once, for the entire lifetime of the script, rather than every time the pattern is executed. Consider:

  my $pat = '^foo$'; # likely to be input from an HTML form field
  foreach( @list ) {
    print if /$pat/o;
  }

This is usually a big win in loops over lists, or when using grep() or map() operators.

In long-lived mod_perl scripts, however, this can pose a problem if the variable changes according to the invocation. The first invocation of a fresh httpd child will compile the regex and perform the search correctly. However, all subsequent uses by the httpd child will continue to match the original pattern, regardless of the current contents of the Perl variables the pattern is dependent on. Your script will appear broken.

There are two solutions to this problem:

The first -- is to use eval q//, to force the code to be evaluated each time. Just make sure that the eval block covers the entire loop of processing, and not just the pattern match itself.

The above code fragment would be rewritten as:

  my $pat = '^foo$';
  eval q{
    foreach( @list ) {
      print if /$pat/o;
    }
  }

Just saying:

  foreach( @list ) {
    eval q{ print if /$pat/o; };
  }

is going to be a horribly expensive proposition.

You can use this approach if you require more than one pattern match operator in a given section of code. If the section contains only one operator (be it an m// or s///), you can rely on the property of the null pattern, that reuses the last pattern seen. This leads to the second solution, which also eliminates the use of eval.

The above code fragment becomes:

  my $pat = '^foo$';
  "something" =~ /$pat/; # dummy match (MUST NOT FAIL!)
  foreach( @list ) {
    print if //;
  }

The only gotcha is that the dummy match that boots the regular expression engine must absolutely, positively succeed, otherwise the pattern will not be cached, and the // will match everything. If you can't count on fixed text to ensure the match succeeds, you have two possibilities.

If you can guarantee that the pattern variable contains no meta-characters (things like *, +, ^, $...), you can use the dummy match:

  "$pat" =~ /\Q$pat\E/; # guaranteed if no meta-characters present

If there is a possibility that the pattern can contain meta-characters, you should search for the pattern or the unsearchable \377 character as follows:

  "\377" =~ /$pat|^[\377]$/; # guaranteed if meta-characters present

Another approach:

It depends on the complexity of the regexp you apply this technique to. One common usage where compiled regexp is usually more efficient is to ``match any one of a group of patterns'' over and over again.

Maybe with some helper routine, it's easier to remember. Here is one slightly modified from Jeffery Friedl's example in his book ``Mastering Regex''.

  #####################################################
  # Build_MatchMany_Function
  # -- Input:  list of patterns
  # -- Output: A code ref which matches its $_[0]
  #            against ANY of the patterns given in the
  #            "Input", efficiently.
  #
  sub Build_MatchMany_Function {
    my @R = @_;
    my $expr = join '||', map { "\$_[0] =~ m/\$R[$_]/o" } ( 0..$#R );
    my $matchsub = eval "sub { $expr }";
    die "Failed in building regex @R: $@" if $@;
    $matchsub;
  }

Example usage:

  @some_browsers = qw(Mozilla Lynx MSIE AmigaVoyager lwp libwww);
  $Known_Browser=Build_MatchMany_Function(@some_browsers);

  while (<ACCESS_LOG>) {
    # ...
    $browser = get_browser_field($_);
    if ( ! &$Known_Browser($browser) ) {
      print STDERR "Unknown Browser: $browser\n";
    }
    # ...
  }

[TOC]


Debugging your code in Single Server Mode

Running in httpd -X mode. (good only for testing during development phase).

You want to test that your application correctly handles global variables (if you have any - the less you have of them the better, but sometimes you just can't without them). It's hard to test with multiple servers serving your cgi since each child has a different value for its global variables. Imagine that you have a random() sub that returns a random number and you have the following script.

  use vars qw($num);
  $num ||= random();
  print ++$num;

This script initializes the variable $num with a random value, then increments it on each request and prints it out. Running this script in multiple server environments will result in something like 1, 9, 4, 19 (number per reload), since each time your script will be served by a different child. (On some OSes, the parent httpd process will assign all of the requests to the same child process if all of the children are idle... AIX...). But if you run in httpd -X single server mode you will get 2, 3, 4, 5... (assuming that the random() returned 1 at the first call)

But do not get too obsessive with this mode, since working only in single server mode sometimes hides problems that show up when you switch to a normal (multi) server mode. Consider an application that allows you to change the configuration at run time.

Let's say the script produces a form to change the background color of the page. It's not a good design, but for the sake of demonstrating the potential problem, we will assume that our script doesn't write the changed background color to the disk, but simply changes it in memory, like:

  use vars qw($bgcolor);
    # assign default value at first invocation
  $bgcolor ||= "white";
    # modify the color if requested to
  $bgcolor = $q->param('bgcolor') || $bgcolor;

So you have typed in a new color, and in response, your script prints back the html with a new color - you think that's it! It was so simple. And if you keep running in single server mode you will never notice that you have a problem...

If you run the same code in the normal server mode, after you submit the color change you will get the result as expected, but when you will call the same URL again (not reload!) chances are that you will get back the original default color (white in our case), since except the child who processed the color change request no one knows about their global variable change. Just remember that children can't share information, other than that which they inherited from their parent on their load. Of course you should use a hidden variable for the color to be remembered or store it on the server side (database, shared memory, etc).

Also note that since the server is running in single mode, if the output returns HTML with <IMG> tags, then the load of these will take a lot of time.

When you use Netscape client while your server is running in single-process mode, if the output returns a HTML with <IMG> tags, then the load of these will take a lot of time, since the KeepAlive feature gets in the way. Netscape tries to open multiple connections and keep them open. Because there is only one server process listening, each connection has to time-out before the next succeeds. Turn off KeepAlive in httpd.conf to avoid this effect.

Also note that since the server is running in single mode, if the output returns HTML with <IMG> tags, then the load of these will take a lot of time. If you use Netscape while your server is running in single-process mode, HTTP's KeepAlive feature gets in the way. Netscape tries to open multiple connections and keep them open. Because there is only one server process listening, each connection has to time-out before the next succeeds. Turn off KeepAlive in httpd.conf to avoid this effect while developing or you can press STOP after a few seconds (assuming you use the image size params, so the Netscape will be able to render the rest of the page).

In addition you should know that when running with -X you will not see any control messages that the parent server normally writes to the error_log. (Like ``server started, server stopped and etc''.) Since httpd -X causes the server to handle all requests itself, without forking any children, there is no controlling parent to write status messages.

[TOC]


-M and other time() file tests under mod_perl

Under mod_perl, files that have been created after the server's (child?) startup are being reported with negative age with -M (-C -A) test. This is obvious if you remember that you will get the negative result if the server was started before the file was created and it's a normal behavior with any perl.

If you want to have -M test to count the time relative to the current request, you should reset the $^T variable as with any other perl script. Just add $^T=time; at the beginning of the scripts.

[TOC]


Handling the 'User pressed Stop button' case

When a user presses the STOP button, Apache will detect that via $SIG{PIPE} and will cease the script execution. When we are talking about mod_cgi, there is generally no problem, since all opened files will be closed and all the resources will be freed (almost all -- if you happened to use external lock files, most likely the resources that are being locked by these will be left blocked and non-usable by any others who use the same advisory locking scheme.)

It's important to notice that when the user hits the browser's STOP button, the mod_perl script is blissfully unaware until it tries to send some data to the browser. At that point, Apache realizes that the browser is gone, and all the good cleanup stuff happens.

Starting from apache 1.3.6 apache will not catch SIGPIPE anymore and modperl will do it much better. Here is something from CHANGES from Apache 1.3.6.

  *) SIGPIPE is now ignored by the server core.  The request write
  routines (ap_rputc, ap_rputs, ap_rvputs, ap_rwrite, ap_rprintf,
  ap_rflush) now correctly check for output errors and mark the
  connection as aborted.  Replaced many direct (unchecked) calls to
  ap_b* routines with the analogous ap_r* calls.  [Roy Fielding]

What happens if your mod_perl script has some global variables, that are being used for resource locking?

It's possible not to notice the pitfall if the critical code section between lock and unlock is very short and finishes fast, so you never see this happens (you aren't fast enough to stop the code in the middle). But look at the following scenario:

  1. lock resource
     <critical section starts>
  2. sleep 20 (== do some time consuming processing)
     <critical section ends>
  3. unlock resource

If user presses STOP and Apache sends SIGPIPE before step 3, since we are in the mod_perl mode and we want the lock variable to be cached, it will be not unlocked. A kind of deadlock exists.

Here is the working example. Run the server with -X, Press STOP before the count-up to 10 has been finished. Then rerun the script, it'll hang in while(1)! The resource is not available anymore to this child.

  use vars qw(%CACHE);
  use CGI;
  $|=1;
  my $q = new CGI;
  print $q->header,$q->start_html;
  
  print $q->p("$$ Going to lock!\n");
  
   # actually the while loop below is not needed 
   # (since it's an internal lock and accessible only 
   # by the same process and it if it's locked... it's locked for the
   # whole child's life
  while (1) {
    unless (defined $CACHE{LOCK} and $CACHE{LOCK} == 1) {
      $CACHE{LOCK} = 1;
      print $q->p("Got the lock!\n");
      last;
    }
  }
  print $q->p("Going to sleep (I mean working)!");
  my $c=0;
  foreach (1..10) {
    sleep 1;
    print $c++,"\n<BR>";
  }
  
  print $q->p("Going to unlock!");
  $CACHE{LOCK} = 0;
  print $q->p("Unlock!\n");

You may ask, what is the solution for this problem? As noted in the END blocks any END blocks that are encountered during compilation of Apache::Registry scripts are called after the script done is running, including subsequent invocations when the script is cached in memory. So if you are running in Apache::Registry mode, the following is your remedy:

  END {
    $CACHE{LOCK} = 0;
  }

Notice that the END block will be run after the Apache::Registry::handler is finished (not during the cleanup phase though).

If you are into a perl API, use the register_cleanup() method of Apache.

  $r->register_cleanup(sub {$CACHE{LOCK} = 0;});

If you are into Apache API Apache-&gt;request-&gt;connection-&gt;aborted() construct can be used to test for the aborted connection.

I hope you noticed, that this example is very misleading, since there is a different instance of %CACHE in every child, so if you modify it -- it is known only inside the same child, none of global %CACHE variables in other children is getting affected. But if you are going to work with code that allows you to control variables that are being visible to every child (some external shared memory or other approach) -- the hazard this example still applies. Make sure you unlock the resources either when you stop using them or when the script is being aborted in the middle, before the actual unlocking is being happening.

[TOC]


Handling the server timeout cases and working with $SIG{ALRM}

A similar situation to Pressed Stop button disease happens when client (browser) timeouts the connection (is it about 2 minutes?) . There are cases when your script is about to perform a very long operation and there is a chance that its duration will be longer than the client's timeout. One case I can think about is the DataBase interaction, where the DB engine hangs or needs a lot of time to return results. If this is the case, use $SIG{ALRM} to prevent the timeouts:

    $timeout = 10; # seconds
  eval {
    local $SIG{ALRM} =
        sub { die "Sorry timed out. Please try again\n" };
    alarm $timeout;
    ... db stuff ...
    alarm 0;
  };
  
  die $@ if $@;

But, as lately it was discovered local $SIG{'ALRM'} does not restore the original underlying C handler. It was fixed in the mod_perl 1.19_01 (CVS version). As a matter of fact none of the local $SIG{FOO} restore the original C handler - read Debugging Signal Handlers ($SIG{FOO}) for a debug technique and a possible workaround.

[TOC]


Where do the warnings/errors go?

Your CGI does not work and you want to see what the problem is. The best idea is to check out any errors that the server may be reporting. Where I can find these errors?

Generally all errors are logged into an error_log file. The exact file location and name are defined in the http.conf file. Look for the ErrorLog parameter. My httpd.conf says:

  ErrorLog var/logs/error_log

Hey, where is the beginning of the path? There is another Apache parameter called ServerRoot. Every time apache sees a value of the parameter with no absolute path (e.g /tmp/my.txt) but with relative path (e.g my.txt) it prepends the value of the ServerRoot to this value. I have:

  ServerRoot /usr/local/apache

So I will look for error_log file at /usr/local/apache/var/logs/error_log. Of course you can also use an absolute path to define the file's location at the file system.

<META>: is this 100% correct?

But there are cases when errors don't go to the error_log file. For example some errors are being printed to the console (tty) you have executed the httpd from (unless you redirected the httpd's stderr flow). This happens when the server didn't open the error_log file for writing yet.

For example, if you have mistakenly entered a non-existent directory path in your ErrorLog directive, the error message will be printed on the controlling tty. Or, if the error happens when server executes PerlRequire or PerlModule directive you might see the errors here also.

You are probably wonder where all the errors go when you are running the server in single mode (httpd -X). They go to the console. That is because when running in the single mode there is no parent httpd process to perform all the logging. It includes all the status messages that generally show up in the error_log file.

</META>

[TOC]


Setting environment variables for scripts called from CGI.

Perl uses sh() for its iteractions for system() and open() calls. So when you want to set a temporary variable when you call a script from your CGI you do:

 open UTIL, "USER=stas ; script.pl | " or die "...: $!\n";

or

  system "USER=stas ; script.pl";

This is useful for example if you need to invoke a script that uses CGI.pm from within a mod_perl script. We are tricking the perl script to think it's a simple CGI, which is not running under mod_perl.

  open(PUBLISH, "GATEWAY_INTERFACE=CGI/1.1 ; script.cgi
       \"param1=value1&param2=value2\" |") or die "...: $!\n";

Make sure, that the parameters you pass are shell safe (All ``unsafe'' characters like single-tick should be properly escaped).

However you are fork-ing to run a Perl script, so you have thrown the so hardly gained performance out the window. Whatever script.cgi is now, it should be moved to a module with a subroutine you can call directly from your script, to avoid the fork.

[TOC]


The Writing Apache Modules with Perl and C book can be purchased online from O'Reilly and Amazon.com.
Your corrections of either technical or grammatical errors are very welcome. You are encouraged to help me to improve this guide. If you have something to contribute please send it directly to me.
[ Prev | Main Page | Next ]

Written by Stas Bekman.
Last Modified at 09/26/1999
Mod Perl Icon Use of the Camel for Perl is
a trademark of O'Reilly & Associates,
and is used by permission.