I'm trying to escape <br />
and the likes in my Magento meta description.
So I've come up with this:
$characters = array("<br />", "<br>", "<br/>");
$badDesc = htmlspecialchars($this->getDescription());
$goodDesc = preg_replace($characters, ' ', $badDesc);
but the only characters thats escaped is "br /", but remaining is the "< >"
What do?
Perhaps this is worth a shot (note: untested)
$desc = preg_replace('/\<br\b[^>]*>/i', ' ', $this->getDescription());
The expression explained:
\<br
is a literal match for the string <br
\b
is a word boundary: preg_match('/foo\bbar/', 'foobar')
will not match, but preg_match('/foo\bbar/', 'foo bar')
will match. That is, in essence a word-boundary. The beginning and ending of a word[^>]*
matches all chareacters except for a literal >
. The asterisk states that this character class may occur zero or more times: with <br />
, for example, this char class will match /
(all spaces and the forward slash. Given this: <br>
, then this part will be skipped (occurs zero times)>
is a litteral match for the close-tag >
charIf your markup is valid (ie not malformed), this expression will remove nothing you don't want to remove. But given strings like this: <br data-string="<b>Don't include markup here</b>"/>
this expression will fail: there is a property that contains markup, but that is something I, personally, find revolting. You don't include markup in an attribute of a tag, IMO.
Another case where regex lets the guard down is when encountering malformed markup:
<br/The closing > was omitted</p>
The regex will match the opening <br
, then the [^>]*
will match:
/The closing > was omitted</p
Only to match the >
of </p>
as the end of the br
tag. But that's just the "fault" of whoever wrote the markup...
It is a little-known fact that preg_*
functions can use matching brackets (parentheses, square, brace or angle) as delimiters. This is especially helpful as it means you don't have to escape those brackets inside the regex itself. Personally I like to use parentheses, as that helps me remember that "index 0" of the match array represents the entire match.
Anyway, in this case your angle brackets are being used as delimiters, making the expression search for br /
, br
and br/
.
Use str_replace
instead. You don't need preg_*
for constant strings.
EDIT: That said, you're using htmlspecialchars
first. In addition to using str_replace
, make sure you are using the replacement before you mangle the HTML ;)
Try this regexp :
$desc = preg_replace('/\<br(\s.*)?\/?\>/i', " ", $this->getDescription());
Adapted from a comment in the php docs.
Since you are escaping a string to use as meta description, you should consider using strip_tags
to remove all html tags.
$description = strip_tags($this->getDescription());
The function also takes a second argument
// strips every tag except <a> and <p>
$description = strip_tags($this->getDescription(), "<a><p>");