在PHP中删除所有空的HTML标记对

I'm looking for a way to strip out all empty HTML tag pairs, such as <strong></strong> and <p class="bold"></p> from a string. While it is relatively easy to find a regular expression for this purpose, I can't find one that would reliably work with PHP's preg_replace(). Here's one of the functions that I have tried (taken from https://stackoverflow.com/a/5573115/1784564):

function strip_empty_tags($text) {
    // Match empty elements (attribute values may have angle brackets).
    $re = '%
        # Regex to match an empty HTML 4.01 Transitional element.
        <                    # Opening tag opening "<" delimiter.
        ((?!iframe)\w+)\b    # $1 Tag name.
        (?:                  # Non-capture group for optional attribute(s).
          \s+                # Attributes must be separated by whitespace.
          [\w\-.:]+          # Attribute name is required for attr=value pair.
          (?:                # Non-capture group for optional attribute value.
            \s*=\s*          # Name and value separated by "=" and optional ws.
            (?:              # Non-capture group for attrib value alternatives.
              "[^"]*"        # Double quoted string.
            | \'[^\']*\'     # Single quoted string.
            | [\w\-.:]+      # Non-quoted attrib value can be A-Z0-9-._:
            )                # End of attribute value alternatives.
          )?                 # Attribute value is optional.
        )*                   # Allow zero or more attribute=value pairs
        \s*                  # Whitespace is allowed before closing delimiter.
        >                    # Opening tag closing ">" delimiter.
        \s*                  # Content is zero or more whitespace.
        </\1\s*>             # Element closing tag.
        %x';
    while (preg_match($re, $text)) {
        // Recursively remove innermost empty elements.
        $text = preg_replace($re, '', $text);
    }

    return $text;
}

And this is HTML I've been testing against:

<strong class="a b">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.l<br class="a  b" />fd<br class="a  b" /><br class="a  b" /></strong><strong class="a b"></strong><strong class="a b"><br class="a  b" /></strong><strong class="a b"></strong><br class="a  b" /><strong class="a b"><br class="a  b" /><br class="a  b" /></strong>

So far, all methods I have tried (been at it 4+ hours), seem to strip some, but not all tags, and this is driving me insane. Any help would be greatly appreciated.

Need an unicode regex as the sample "empty" tags are actually not empty:

$re = '~<(\w+)[^>]*>[\p{Z}\p{C}]*</\1>~u';

\p{Z} ... any kind of whitespace or invisible separator
\p{C} ... invisible control characters and unused code point

Used u (PCRE_UTF8) modifier; test at regex101


To also include <br>, <br /> as an empty element:

$re = '~<(\w+)[^>]*>(?>[\p{Z}\p{C}]|<br\b[^>]*>)*</\1>~ui';

test at regex 101


To also match tags with space entities

$re = '~<(\w+)[^>]*>(?>[\p{Z}\p{C}]|<br\b[^>]*>|&(?:(?:nb|thin|zwnb|e[nm])sp|zwnj|#xfeff|#xa0|#160|#65279);)*</\1>~iu'

test at regex101; Modify to your needs.


To use a recursive regex (without while loop)

$re = '~<(\w+)[^>]*>(?>[\p{Z}\p{C}]|<br\b[^>]*>|&(?:(?:nb|thin|zwnb|e[nm])sp|zwnj|#xfeff|#xa0|#160|#65279);|(?R))*</\1>~iu';

test at regex101

Following my comment on Jonny 5's answer; I've added a couple acceptable tags into the recursive regex since iframe and canvas are typically ok to be empty.

$re = '~<((?!iframe|canvas)\w+)[^>]*>(?>[\p{Z}\p{C}]|<br\b[^>]*>|&(?:(?:nb|thin|zwnb|e[nm])sp|zwnj|#xfeff|#xa0|#160|#65279);|(?R))*</\1>~iu';