Painfully regular expressions

Thursday, December 6, 2007

Updated: Chuck provided me with a much more elegant solution than described below. It's explained fully in Over-Thought Solutions.

I originally used the Apache rewrite engine to parse and rewrite the URLs to call my handler.php file with appropriate variables, e.g., translating:

  • http://sewcrates.com/tags/Programming/

to

  • http://sewcrates.com/handler.php?type=tags&tag=programming

While it worked reasonably well, I decided to scrap it and move the parsing code to the PHP handler to make it easier to manage and change.

It ended up being more difficult to code a generic rewrite condition than the individual rewrites that converted the URLs to their specific types. It was the regular expression used by Apache that required much work. I’ve used regular expressions rarely, and when I do, I try to find an example of what I want to do and copy it. This time, I decided to go to the source and actually learn what I was doing (mostly because I couldn’t find a good example). My sources:

My goal was to capture all URLs with 0-5 parameters, e.g., rewriting

  1. http//sewcrates.com/
  2. http://sewcrates.com/first/
    . . .
  3. http://sewcrates.com/first/second/third/fourth/fifth

to

  1. http://sewcrates.com/handler.php
  2. http://sewcrates.com/handler.php?a=first
    . . .
  3. http://sewcrates.com/handler.php?a=first&b=second&c=third&d=fourth&e=fifth

I used a-e because 1-5 didn't work in the Apache rewrite statement. In the handler.php code, I would take $_GET['a'] through $_GET['e'] and use them to identify the type and content of the URL. Here’s what I (finally) came up with for my .htaccess file:

I’m not sure if it’s the best way, but after much testing, it does work. Besides my normal technique of “adding one when in doubt”—or in this case adding a question mark, there is some logic behind the expression. I’ll try to break it down. The important piece of code is the RewriteRule line, which is made up of two parts: the match and the result.

Match: ^([^/]*)/?([^/]*)/?([^/]*)/?([^/]*)/?([^/]*)/?$

Result: /handler.php?a=$1&b=$2&c=$3&d=$4&e=$5

If the URL fits the Match, then Apache rewrites it with the Result. The Result includes a bunch of variables. For each (…) statement, Apache rewrites it to a URL variable (the $1-$5 represents the results of the regular expression). I found regular expressions that captured an absolute amount of parameters, but not one that loaded arbitrary variables.

The Match was the difficult part. To start with, regular expressions are anchored by two characters: it starts with ^ and ends with $. Removing those from the expression leaves us the meat:

([^/]*)/?([^/]*)/?([^/]*)/?([^/]*)/?([^/]*)/?

Each (…) statement corresponds to a variable in the Result code, ranging from $1 through $n, where n is the number of (…) statements. All of the (…) statements look like this:

([^/]*)

If you look at the references above, you’ll see that the [^…] code means match any character except the one that follows the ^. In this case, I excluded the forward slash, which is the end character. The * after the [^/] tells the expression to repeat this match from 0 to n characters. In other words, match all characters until you find a forward slash.

Between each (…) statement, I included a /?. The slash is the character I expect to find between each expression (e.g., http://sewcrates.com/first/second/). The question mark means match 0 or 1 of the slashes. The advantage of using the question mark is that it allows for there to be zero slashes, or, in other words, it allows for less than five parameters.

That’s it. I use the RewriteCond to exclude the files and directories I want the server to find, and the rest get reformatted into this code. My handler.php file captures the results and parses it using the $_GET[‘a’]…$_GET[‘e’] variables. From there, I figure out what page should be sent to the client.

After learning regular expressions, I dived into the Perl documentation, and was amazed at how good Perl was compare to PHP (or even Python, which I fell in love with earlier in the week). It’s amazing how powerful and easy to use Perl is. I see where PHP steals many of its ideas (and then goes about badly implementing them). I spent many hours poring over the documentation, thinking I was going to change NAIS to use Perl. Like my earlier decision with Python, I decided to stick with PHP. It seems I fall out of love as fast as I fall in love with languages.

 Seattle, WA | , ,