I've been investigating this problem for several hours now and narrowed it down to these few lines of code. I know the code isn't perfect, but it's what I've got to work with from the developer. The script is supposed to filter out potentially malicious code. But the problem is that the string seems to become empty whenever someone uses a special character, such as á, ñ, ö, etc.
For example, if someone writes "viva españa", the string goes empty.
If someone writes "viva espana" (without the ñ), it's all good.
The same goes for other special characters. What could be causing this? I have virtually zero knowledge about regular expressions, so it's a bit like garbage to me, but what I do know is that when I comment out these lines, the script works both with and without the special characters in the string and the moment I uncomment them, it only works without special characters in the string.
Any ideas?
These are the code lines:
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#u', "$1;", $string);
$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string);
$string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])(on|xmlns)[^>]*>#iUu', "$1>", $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2nojavascript...', $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2novbscript...', $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*-moz-binding[\x00-\x20]*:#Uu', '$1=$2nomozbinding...', $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*data[\x00-\x20]*:#Uu', '$1=$2nodata...', $string);
$string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])style[^>]*>#iUu', "$1>", $string);
I would suggest not using u
. That flag specifies that the string is in Unicode, but you're only working with strings in the ASCII range.