Simple Data Sanitization with preg_replace

/ PHP / by Paul Robinson / 6 Comments
This post was published back on June 28, 2010 and may be outdated. Please use caution when following older tutorials or using older code. After reading be sure to check for newer procedures or updates to code.

Throughout this tutorial I will be using preg_replace as mentioned in the title. I have chosen PCRE type functions over POSIX as POSIX is being completely removed in PHP 6 & has been marked deprecated (depreciated) since PHP version 5.3.0.

First let’s look at what exactly data sanitization is & why we need it.

What Is Data Sanitization?

Data sanitization is the act of making user input data safe. That is taking any data that is provided by the user of the application and removing or escaping anything that could possibly be harmful to the running of the application.

There are two methods I use to sanitize data.

Escaping

Escaping is the method of making data safe by nullifying its value from a programmatic standpoint, therefore turning it into a string or plain text.

Purging or Removal

Purging or removal is the act of removing data that could be damaging to the application. In most cases though you only purge/remove data which is required to make the data dangerous. For example, removing the wakkas from a strong tag (< >) will nullify its ability to make text bold & just leave innocent text in its place.

Data Sanitization Using REGEX

In this tutorial we are going to take a quick look at using REGEX along with preg_replace() to sanitize using the purge/removal method. We are going to remove all characters that we do not expect to be included in our input data. This is useful for scripts such as simple searches where we expect our input to only contain text. It can also be useful for simple text fields in forms such as name fields where we expect the value to only contain simple text.

Let’s assume we have some data coming in via POST from a form. Let’s also assume that our form submitted this Data <strong>Sanitization</strong>

This will result in:

Data strongSanitizationstrong

Why? Well because of our REGEX pattern. That’s the strange /[^a-z0-9 ]/i part. The square brackets define a pattern of characters & the caret (^) symbol tells it that we want to find characters that do not match that pattern. The characters themselves are written like that because they mean match from a to z & 0 to 9, the space is there so it doesn’t remove spaces too. The i at the end tells it to be case insensitive. We replace with nothing and the last parameter is the variable that contains the data. You may notice the strong text is still there. We didn’t remove the entire tag as there is a PHP function already available to remove unwanted HTML tags. This was just an example to explain how preg_replace works.

So you may be wondering why this helps stop malicious attacks. Well one example is data sent though $_GET. You may have noticed a post by Lisa Marie yesterday highlighting a problem with the Zazzle Store builder. The problem was that Javascript could be passed through to one of the variables that output into the HTML because the variable wasn’t checked for characters that wouldn’t be necessary for the job it performed. Adding REGEX similar to the one above, checking for data that wasn’t required by the variable, (the variable only needed letters, numerics, commas, and underscores) means we could nullify any Javascript injection by removing the special characters need for it to be parsed. Anyone that did attempt it would find their Javascript reduced to harmless text.

Matching Harmful Attributes

So what about if you want to allow HTML but not Javascript based attributes such as onclick & onmouseover? Well you could try this:

I’m no expert on REGEX, but this seems to work well for me. We match the word ‘on’ followed by any amount of letters (+ means one or more) until it hits an equals symbol then a quote which must be backslashed. Then any type and number of character (.*), Then a quote again. Case insensitive as before. This takes something like <a href="#" onclick="alert('hi');">text</a> and removes the event attribute turning into <a href="#" >text</a>.

You may be thinking what about harmful HTML tags? Well your complete code could be this:

That would use strip_tags() to remove all HTML tags except the ones in your list. Then the REGEX would remove any Javascript event attributes in the HTML tags that remain.

Conclusions

Data sanitization is an extremely important part of coding that should be observed when dealing with any amount of data entered by a user. I was taught to observe one rule, never trust the user. It sounds horrible, but it’s better not to trust the data entered by users & be wrong than trust it and end up being victim to an exploit.

I hope this post has been helpful. If you have any useful pieces of REGEX that can used for data sanitization then please feel free to share them in the comments. With you permission I might even add them to this post.

6 Comments

Author’s gravatar

hi paul, i keep finding you in the google results for random things. i like the graphic of soap/sanitizer. 🙂 i’m trying to get my head around data sanitation. regex is still kind of voodoo in my book.

Reply
Author’s gravatar author

Hi Kathy.

To be honest I’m just glad Google is keeping me in their results. I haven’t had a lot of time to do any new posts recently. 🙁

REGEX is, and always will be, in my mind voodoo. It is some sort of magic. I half expect Harry Potter (although I’d be happier if it was Harmione) to pop up behind me casting spells. Haha.

Author’s gravatar

In your first example above, the result is wrong. The output is:

Data strongSanitizationstrong

The angled brackets and forward slash is stripped out and replaced by no character.

Reply
Author’s gravatar author

Hi Kerry,

Indeed you are correct. It should have said that in the first place. Sorry about that I’ve corrected the post.

Author’s gravatar

Remove on* tags are really handy, but what happens if you need to submit code? and display it in pre tags?

Reply
Author’s gravatar author

Well that would be a whole different problem.

If the code you want to show is submitted separate you could just convert the characters that make it dangerous to harmless entities using htmlentities().

If the code is all together then I’m not sure. You could use preg_match() to find any code inside a pre tag. Convert to entities then put it back. I’m not sure of the full code, and there may be a much easier way of doing it, but that’s the best I can think of at the moment.

Older Comments
Newer Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

I'll keep your WordPress site up-to-date and working to its best.

Find out more