Sanitizing data is a very important part of any server side scripting language, and it’s no different for PHP. So here is one simple way of sanitizing data using the dreaded REGEX.

Throughout this tutorial I will be using preg_replace as mentioned in the title. I have chosen PCRE type functions over POSIX as POSIX is being completely removed in PHP 6 & has been marked deprecated (depreciated) since PHP version 5.3.0.

First let’s look at what exactly data sanitization is & why we need it.

What Is Data Sanitization?

Data sanitization is the act of making user input data safe. That is taking any data that is provided by the user of the application and removing or escaping anything that could possibly be harmful to the running of the application.

There are two methods I use to sanitize data.

Escaping

Escaping is the method of making data safe by nullifying its value from a programmatic standpoint, therefore turning it into a string or plain text.

Purging or Removal

Purging or removal is the act of removing data that could be damaging to the application. In most cases though you only purge/remove data which is required to make the data dangerous. For example, removing the wakkas from a strong tag (< >) will nullify its ability to make text bold & just leave innocent text in its place.

Data Sanitization Using REGEX

In this tutorial we are going to take a quick look at using REGEX along with preg_replace() to sanitize using the purge/removal method. We are going to remove all characters that we do not expect to be included in our input data. This is useful for scripts such as simple searches where we expect our input to only contain text. It can also be useful for simple text fields in forms such as name fields where we expect the value to only contain simple text.

Let’s assume we have some data coming in via POST from a form. Let’s also assume that our form submitted this Data <strong>Sanitization</strong>

//Data in $_POST['data']
$data = $_POST['data'];
$data = preg_replace('/[^a-z0-9 ]/i', '', $data);

This will result in:

Data strongSanitizationstrong

Why? Well because of our REGEX pattern. That’s the strange /[^a-z0-9 ]/i part. The square brackets define a pattern of characters & the caret (^) symbol tells it that we want to find characters that do not match that pattern. The characters themselves are written like that because they mean match from a to z & 0 to 9, the space is there so it doesn’t remove spaces too. The i at the end tells it to be case insensitive. We replace with nothing and the last parameter is the variable that contains the data. You may notice the strong text is still there. We didn’t remove the entire tag as there is a PHP function already available to remove unwanted HTML tags. This was just an example to explain how preg_replace works.

So you may be wondering why this helps stop malicious attacks. Well one example is data sent though $_GET. You may have noticed a post by Lisa Marie yesterday highlighting a problem with the Zazzle Store builder. The problem was that Javascript could be passed through to one of the variables that output into the HTML because the variable wasn’t checked for characters that wouldn’t be necessary for the job it performed. Adding REGEX similar to the one above, checking for data that wasn’t required by the variable, (the variable only needed letters, numerics, commas, and underscores) means we could nullify any Javascript injection by removing the special characters need for it to be parsed. Anyone that did attempt it would find their Javascript reduced to harmless text.

Matching Harmful Attributes

So what about if you want to allow HTML but not Javascript based attributes such as onclick & onmouseover? Well you could try this:

//Data in $_POST['data']
$data = $_POST['data'];
$data = preg_replace('/on[a-z]+=\".*\"/i', '', $data);

I’m no expert on REGEX, but this seems to work well for me. We match the word ‘on’ followed by any amount of letters (+ means one or more) until it hits an equals symbol then a quote which must be backslashed. Then any type and number of character (.*), Then a quote again. Case insensitive as before. This takes something like <a href="#" onclick="alert('hi');">text</a> and removes the event attribute turning into <a href="#" >text</a>.

You may be thinking what about harmful HTML tags? Well your complete code could be this:

//Data in $_POST['data']
$data = $_POST['data'];
$data = strip_tags($data, '<a><strong><em>');
$data = preg_replace('/on[a-z]+=\".*\"/i', '', $data);

That would use strip_tags() to remove all HTML tags except the ones in your list. Then the REGEX would remove any Javascript event attributes in the HTML tags that remain.

Conclusions

Data sanitization is an extremely important part of coding that should be observed when dealing with any amount of data entered by a user. I was taught to observe one rule, never trust the user. It sounds horrible, but it’s better not to trust the data entered by users & be wrong than trust it and end up being victim to an exploit.

I hope this post has been helpful. If you have any useful pieces of REGEX that can used for data sanitization then please feel free to share them in the comments. With you permission I might even add them to this post.