Based on the html syntax docs and trial and error in the validator I believe that the allowed characters in HTML attribute names are:
For example these validate:
<p data-éxample>
<p data-1.5>
I want to write a function for sanitizing attribute names:
<?php
function sanitize_attr_name ( $name ) {
return is_string($name) ? preg_replace( '/[^\w\-\.]/', '', $name ) : '';
}
That works except for the special alpha characters:
sanitize_attr_name( 'data-éxample' ); // 'data-xample'
Now it may seem crazy for someone to use characters like that but it does in fact work albeit the css doesn't seem to validate escaped or not.
How do you pull that off in PHP? How could the sanitizer be written to allow for the special alpha characters? Is that possible via regexp? And why is ctype_graph('é')
false?
PHP's regex engine PCRE supports Unicode character properties with \p{property}
. One of these properties is L
which is the property of any letter. So you could just replace \w
by \p{L}0-9_
:
'/[^\p{L}0-9_.-]/'
There is also no need to escape periods in character classes, and hyphens can be put at the end to avoid escaping.