正则表达式查找HTML注释(<! - some string - >)

I use this regexp to find and replace an HTML comment traditionally:

//remove HTML comments
$HTML = preg_replace('/<!--(.|\s)+?-->/','',$HTML);

However, on one server that's apparently crashing (works fine on my VM but it's pretty high powered).

The logic is, start the comment, any character or whitespace (at least some = +), and the ? means "don't be greedy and stop at the first --> you get"

Is there a better way to write this, esp. the (.|\s)+? part?

Without a crash log, it's impossible to know exactly whether your expression is the culprit or not. Assuming it is though, it's likely the result of catastrophic backtracking due to greediness.

And not that I advocate for using regular expressions to parse HTML (you'd be better to use DOMDocument), but if you continue down the regex path use:

$HTML = preg_replace('/<!--([\s\S]+?)-->/','',$HTML);

instead. It'll capture both whitespace and non whitespace, including new lines, and won't blow up due to backtracking.

Example: https://regex101.com/r/qR1xW1/1

If the file is particularly large it might be causing the crashes on the other machine. The way that I would've written this is as follows:

<!--(.+?)-->

There's probably not a particularly decent performance improvement if any at all.

Regex101

You can try this

/<!\-\-[\w\s\S]+?\-\->/
  • <! matches the characters <! literally
  • \- matches the character - literally
  • \- matches the character - literally
  • [\w\s\S]+? match a single character present in the list below
  • \w match any word character [a-zA-Z0-9_]
  • \s match any white space character [ \t\f ]
  • \S match any non-white space character [^ \t\f ]
  • \- matches the character - literally
  • \- matches the character - literally
  • > matches the characters > literally