This question already has an answer here:
I have been at this on and off for a few days, but my RexEx mastery is not great. Yes I understand that RegEx is not for parsing HTML. I am doing server side "cleaning" of CKEditor input, which already does this, but only client side.
After striping none white-listed tags...
First: $html = preg_replace(' on\w+=(["\'])[^\1]*?\1', '', $html);
remove all event attributes properly quoted with either '
or "
quotes
Second: $html = preg_replace(' on\w+=\S+', '', $html);
*remove the ones that have no quotes but still can fire, ex. onclick=blowUpTheBase()
What I would like to do is ensure the onEvent is between <
& >
but I can only get it to work if the onEvent attribute is the first one after a tag. Everything I try ends up capturing most of the code. I just cant get it lazy enough.
ex. $html = preg_replace('<([\s\S]?)( on\w+=\S+) ([\s\S]*?)>', '<$1 $3>', $html);
EDIT: I am going to select @colburton's answer because RegEx is what I asked for. I will also use it for my particular situation because it will due the trick. (it is an internal application anyhow)
BUT
I want to thank @Casimir et Hippolyte for his answer because it gives a great example and explanation about how to do this the "right way". I will in short order write up a function using DOMDocument and it will become my goto way of handling RTE/WYSIWYG/HTML input.
</div>
Maybe I should have mentioned this from the start: This is not how you should try to filter XSS. This is purely academic inside the parameters you proposed (eg. "use RegEx").
This gets you pretty close:
preg_replace('/(<.+?)(?<=\s)on[a-z]+\s*=\s*(?:([\'"])(?!\2).+?\2|(?:\S+?\(.*?\)(?=[\s>])))(.*?>)/ig', "$1 $3", $string);
Tested on
<a href="something" onclick="bad()">text</a> onclick not in tags
<a href="something" onclick=bad()>text</a>
<a href="something" onclick="bad()" >text</a>
<meta name="keywords" content="keyword1, keyword2, keyword3">
<a href="something" onclick= "bad()">text</a> onclick not in tags
<a href="something" onclick =bad()>text</a>
<a href="something" onclick=bad('test')>text</a>
<a href="something" onclick=bad("test")>text</a>
<a href="something" onclick="bad()" >text</a>
What if I write john+onelia=love forever?
Play around here: https://regex101.com/r/GMBaQs/9