I know there is a lot of discussion for years on best methods of filtering data with PHP but I would like to go the whitelist approach in my current project.
I only want a user to be able to use the following HTML
<b>bold</b>
<i>italics</i>
<u>underline</u>
<s>strikethrough</s>
<big>Big size</big >
<small>Small size</small>
Hyperlink <a href="http://www.site.com">website</a>
A Bulleted List:
<ul>
<li>One Item</li>
<li>Another Item</li>
</ul>
An Ordered List:
<ol>
<li> First Item</li>
<li> Second Item</li>
</ol>
<blockquote>Because it is indented</blockquote>
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
Can anyone show me the best method of doing this for performance in PHP? I have only in the past allowed all html minus certain codes
The simplest solution would be strip_tags(),
which accepts a second argument containing allowable tags:
strip_tags($string, "<b><i><u><a><s><big><small><ul><li><ol><blockquote><h1><h2><h3>");
I believe the HTML Purifier Library will work nicely:
HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications. Tired of using BBCode due to the current landscape of deficient or insecure HTML filters? Have a WYSIWYG editor but never been able to use it? Looking for high-quality, standards-compliant, open-source components for that application you're building? HTML Purifier is for you!
Another route is using strip_tags with the second argument.
I would run the submitted code through tidy to normalize it first, and then use xpath or apply xslt to only select allowed elements. This way, nothing can leak. Do bear in mind, too, that in any given website situation you would probably have thousands if not hundreds of thousands of read requests for every write request [that uses tidy and xpath/xslt] so on average the performance impact is negligible. If you are doing batch processing on the other hand..
Edit: oh and: DON'T do this with regular expressions. It is mathematically impossible to do it correctly.