A guide to URL rewriting

From Cosmin's Wiki

Jump to: navigation, search

Home > PHP > A guide to URL rewriting



A lot of today's content on the web is dynamic - that is, the page you see does not physically exist on the server but instead is put together on the fly by a server-side script. Commonly, the method for passing data to the script is via the query string. The resulting links to your pages can therefore end up complicated and unfriendly. For example:

 
http://yourdomain.com/articles/show.php?category=8&article=145&page=3
 

The concept of "pretty URLs" involves converting these complex dynamic URLs into easier to read, static URLs:

 
http://yourdomain.com/articles/8/145-3.html
 

For the sake of clarity and to save unnecessary repetition, I will refer to the first format as "ugly" and the second as "pretty" for the rest of the article. These terms are commonly used with regards to this technique.


Why use pretty URLs?

Pretty URLs have been a common SEO technique for many years now but many of the early benefits are no longer applicable. The bad news is if you are hoping to massively boost your search engine rankings by using pretty URLs, you're about ten years too late. You would be much better off spending the time writing better content.

The good news is you can still benefit from it - keywords in the URL often help with SEO, and there are claims that a static extension such as .html is preferred over the dynamic (.php, .cgi and so on). However, the main reason I will always incorporate URL rewriting into my own projects is it makes the whole application look a lot neater and more professional.

Whatever your reasoning is, pretty URLs are so simple to setup - what have you got to lose?

Note: if your website is already launched, indexed in search engines and linked to from other websites, do not change the URL format unless you are certain you can preserve these inbound links with the appropriate 301 redirects.


How to use pretty URLs?

You have two options here. The first is fairly obvious and almost equally useless - that is, create your chosen file structure either manually every time you update your content, or more sensibly, automatically using server side scripting. As I'm sure you realize, if your pages update very frequently (for example, if you had an online users display), this method is not a viable option. Even without rapidly changing content, you can usually save yourself a lot of work by using URL rewriting.

URL rewriting is a very powerful process that allows your server to receive requests in a certain format, such as our new pretty URL and transparently convert it into our old ugly format in order to locate and retrieve the appropriate resource. I used the term transparently because anyone browsing your site will have no knowledge of this process going on - similarly, neither will any search engine bots.

The process of URL rewriting is carried out by a rewrite engine. The specifics of this are dependent on the server you are using. We will use Apache as an example here. The rewrite engine for Apache is contained within the mod_rewrite module. On a shared hosting account, this should already be enabled - if not, contact your host. If you run your own server, you may need to enable this module by uncommenting the relevant LoadModule command in your httpd.conf. There are plenty of resources readily available to help you with this, and as this is not an article on server administration, we will move on to the more interesting aspects.

Note: Rewriting can only change where the server looks for the requested resource. We cannot automatically change our links using a rewrite engine; you must manually change the output of your pages or scripts to generate the href link in the new pretty format.


mod_rewrite

You have probably already heard of mod_rewrite. The massive power of this module has led to the common misconception that it is "voodoo" or "black magic". Please do not be discouraged by these claims - it really is very easy to use and for most uses, there are probably only two directives you need to know. The full documentation is available from the Apache docs website - apart from the trivial On/Off command, we will only use RewriteRule.

If you are comfortable with a server side scripting language (such as PHP or Perl) but even after reading this tutorial do not understand mod_rewrite, there is an easy way out. I will come back to this at the end.

Let's briefly go over the syntax. We need some way of telling Apache we want to use URL rewriting. As you may already know, we can either do this directly in the Apache configuration file, the httpd.conf or much easier, a per-directory context using .htaccess files. A .htaccess file is merely a method of passing instructions to Apache. These are plain text files, named ".htaccess" (without quotes) and when placed in a directory, the instructions are interpreted by the server whenever a resource (file, image, web page, etc.) is requested from within the directory. You may already be using .htaccess files - if so, simply add in your new rewrite code at the end.

The first two lines required whenever you use mod_rewrite will always be the same:

 
Options +FollowSymLinks
RewriteEngine On
 

FollowSymLinks is a security requirement of the rewrite engine. It is required to be on, and in most cases will already be set in the httpd.conf. If you know you do not need this line, you can leave it out. Otherwise, it does no harm to restate it.

RewriteEngine On is fairly self-explanatory. We are telling Apache to activate the rewrite engine because we want to use it. You can now setup your rewrites. These are done by a series of rules - the syntax for which is:

 
RewriteRule PATTERN DESTINATION FLAGS
 

PATTERN is a regex pattern for which the incoming request is applied to. If the request matches the pattern, the request is rewritten to the DESTINATION. This is the relative path to the resource to load instead.

FLAGS is an optional parameter which allows you to alter the behaviour of your rewrite rule. For the full list, see the documentation. The most common flags are: NC (no case, makes the pattern case insensitive) L (last, stops processing the rewrites if this one has been applied) R (redirect, changes the transparent rewrite to a redirect - a redirect response header is sent, along with the new location) Certain flags, such as the R can take a value. In this case, you can use R=301 to send a 301 permanently moved response code (used for SEO). By default, it will send a 302 found (temporarily moved) response.

If the above does not make much sense, do not worry. We will now look at an example but you should first familiarise yourself with regex if needed. Regex, or to give the full name, regular expressions are a way of matching a series of strings with unknown values. Going back to our original articles example, we do not want a pattern that matches each article individually, we want one pattern that can take any possible number and rewrite as appropriate. We would actually use [0-9]+ or \d+ to match the number. If you do not understand how I arrived at that very basic pattern, I strongly recommend you read up a bit on regex first. There are plenty of resources available but I have found regular-expressions.info very useful if you want somewhere to start. Regex in itself is another 5 tutorials so I cannot explain it all here.

Let's step back and look at a simplified version of the very first URL I showed you to demonstrate an ugly URL: /articles/show.php?article=145 And we want to change that to: /articles/the-article-title/145.html

Note: I purposefully avoided saying "we want to rewrite that as" to avoid confusion. The term rewrite technically refers to the process going on behind the scenes in the server and as such goes the other way (pretty -> ugly). When talking about URL rewriting, it's commonly confused with "I want to rewrite my links to look like X", as in the conversion of ugly -> pretty.

We are staying in the /articles/ directory so we can simplify things by creating our .htaccess file there. Now when we make a request in our pretty format, the string that is applied to the pattern is the-article-title/145.html. (If were using a .htaccess file in the root of the domain (i.e. the web root, www or public_html folder), the string that would be compared to the pattern would be articles/the-article-title/145.html.) "the-article-title" can be any alphanumeric character, the dash or underscore and 145 can be any number. Therefore the pattern you should come up with is:

 
^[a-z0-9_-]+/([0-9]+)\.html$
 

We use the parenthesis (brackets) around the number because we need to capture that as a back reference for use in our destination. Back references are in the form $N when using mod_rewrite. Our destination needs to be:

 
show.php?article=$1
 

And after we've completed that rewrite, we do not wish to do any more rewriting so we can use the L flag. Putting that all together, we can come up with our working RewriteRule:

 
RewriteRule ^[a-z0-9_-]+/([0-9]+)\.html$ show.php?article=$1 [L,NC] 
 

The best way for you to learn how to get the best out of mod_rewrite is trial and error. If you get completely stuck, try asking at a webmaster forum. Hopefully you can see how you could expand this example further to incorporate much more complex uses. If this were a full mod_rewrite tutorial, I could show you the abilities of the module when combined with the only other directive you're likely to need, the RewriteCond. Unfortunately that falls outside the scope of a pretty URLs article but if you are at all interested, I highly recommend you read up on it. It will allow you to do practically anything you might want to do with mod_rewrite, including: - blocking users based on referrer, IP or user agent - forcing either the www or non-www domain to prevent pagerank being split between the two - redirecting to your secure https connection - stop hot-linking to your files

But I digress, back on the subject of pretty URLs, there is one last point to explain. I mentioned earlier that the rewrite engine will not change your links; you have to manually alter your pages or scripts to generate the new pretty URLs. In the example above, we included the article title in the URL to help with our SEO efforts. You may have noticed I only included alphanumeric characters, the dash and the underscore in the pattern - this is because there are plenty of other special characters that legitimately could be used in an article title but cannot be used in an URL. Therefore it is quickest, easiest and safest to simply strip out these characters. I'll leave the exact implementation to you as every script is different, but you would probably use something along the lines of (in PHP):

 
$safeTitle = preg_replace('#[^A-Za-z0-9_-]#','',$articleTitle);
 

I have also included the actual ID in all my examples. In theory, you could alter your script to rely not on the ID but the name/title as the unique identifier and remove the ID altogether from the URL. This requires additional editing of your scripts but will result in an "prettier" URL.


Earlier I mentioned an easier alternative in case you simply cannot understand the mod_rewrite topic. This method will not always work and is best suited to your own custom built applications. If you are using a particular script and want to quickly convert your ugly URLs to pretty URLs, it is best to use a detailed ruleset for mod_rewrite to process. The alternative is to say "if the requested file does not exist, rewrite to a particular script". This means that requests for images, CSS files, etc. will still return the correct file. Everything else will be sent to the script to process where the requested URL will be available under the server REQUEST_URI variable. To use this method, you would need to create a .htaccess file in your domain root (or wherever you want the rewrite to apply) with the following inside:

 
Options +FollowSymLinks
RewriteEngine On
RewriteCond %{SCRIPT_FILENAME} !-d
RewriteCond %{SCRIPT_FILENAME} !-f
RewriteRule .* index.php [L] 
 

Replace index.php with the name of whatever script you wish to use. Sticking with the PHP example, I could then create the index.php file as follows:

 
<?php
$requested = empty($_SERVER['REQUEST_URI']) ? false : $_SERVER['REQUEST_URI'];
 
switch ( $requested ) {
 
	case '/dogs':
		include 'animals/dogs.php';
		break;
 
	case '/apple':
		include 'fruit.php?type=apple';
 
	default:
		include '404.php';
}
 
?>
 

Obviously that would not be a very practical solution but it illustrates the point. Remember, the REQUEST_URI variable will contain everything after the hostname and should start with a leading slash. You will need to process the request a bit better than I have, but the reason you are using this method is because you are comfortable with your server side scripting!

Finally, going back to the proper method or rewriting, let's finish up with a few more examples of pretty URLs, the old ugly URL and the rewrite code needed to convert them. Again, you will need to manually change your script to output the new pretty URLs.

Pretty URL: /browse/animals-24/cats-76.html
Ugly URL: /browse.php?category=24&subcategory=76
.htaccess:

 
Options +FollowSymLinks
RewriteEngine On
RewriteRule ^browse/[A-Z0-9_-]+-([0-9]+)/[A-Z0-9_-]+-([0-9]+)\.html$ browse.php?category=$1&subcategory=$2 [NC,L]
 

Pretty URL: /download/5g2cg4re9wqcxpo4fawsw45xx2mbu4tb/334/myFile.zip
Ugly URL: /download.php?session_id=5g2cg4re9wqcxpo4fawsw45xx2mbu4tb&file=334
.htaccess:

 
Options +FollowSymLinks
RewriteEngine On
RewriteRule ^download/([A-Z0-9]{32})/([0-9]+)/ download.php?session_id=$1&file=$2 [NC,L]