如何找到尚未使用正则表达式关闭的最后一个HTML标记？

Lets say I have this string

      $string = "<html>
                 <body>
                 <h1>
                 <b>aaa</b> bbbb
                 ";

I want the result to be "h1" because it is the latest unclosed tag

another example

if the string is

     $string = "<body>
                <img src='' alt=
               ";

the result should be "img" tag because it is the latest unclosed tag

I knew it could be done by regular expressions but I am not good in using the regular expressions

I doubt that it's possible to do this with just a few regular expressions, since it's not a pattern you are searching for.

I'd go through the string using a stack and everytime you see an opening tag you put it on the stack and everytime you find the matching closing tag you remove it from the stack.

So if you went through the first part of example1:

<html>
  <body>
    <h1>
      <b>

Your stack should be:

html,body,h1,b

Next b closes and you remove it from the stack, so your stack looks like this:

html, body, h1

Now the tag that's on top of your stack(h1) is always the one you're looking for.

I hope you get what I mean, if not let me know.

I almost started to write a regular expression, but I gave up after realizing that I also have to ignore comments and strings (such as attribute values) containing text that could potentially evaluated as a closing tag:

 $string = "<html>
                 <body>
                 <h1>
                 <b>aaa</b> bbbb
                 <!--</h1> maybe it's silly to have such a comment but who knows-->
                 ";

My advice is to use a real parser, not a regex.

The code below uses a couple of regexes to do the parsing. Beware though that real world html might easily break it when inserting random spaces, tabs etz inside tags and code. The code below includes an array of test cases to run problem code through.

The idea here is to first clean up the html, then remove tags with closing tags and finally return the last tag available.

<html>

<head><title>Last Open HTML Tag</title>

<body>

<h1>Last Open HTML Tag</h1>
<?php

$htmlstrings[] ="<html>
                 <body>
                 <h1>
                 <b>aaa</b> bbbb
                 ";

$htmlstrings[] ="<html>
                 <body>
                 <h3>test</h3>
                 <h1>
                 <b>aaa <i>test2</i></b> <i>test</i> bbbb
                 ";

$htmlstrings[] = "<body>
                <img src='' alt=
               ";

$htmlstrings[] = "<body>
                < img src='' alt=
               ";

$num = 1;              
foreach( $htmlstrings as $rawstring){
    // First remove whitespace in tags
    $string = preg_replace ( "/<\s*(\w)/", "<$1", $rawstring);
//    $string = preg_replace ( "/<\s*/\s*(\w)/", "</$1", $string);

    $real_matches = array();

    // Find open html tag (<a ...)
    if( preg_match( "/<(\w*)\W[^><]*$/", $string, $matches) > 0){
        $real_matches = $matches;
    // Find html tag with no end tag (<h1>...)
    } else {
        $newstrin = null;
        while( true){
            $newstring = preg_replace( "/<(\\w*)>[^<>]*<\\/\\1>/s", "", $string);
            if( $newstring == $string){
                break;
            }
            $string = $newstring;
        }
        preg_match( "/<(\\w*)>[^<>]*$/", $newstring, $matches);
        $real_matches = $matches;
    }

    echo "<p>Parse $num
";
    $rawstring = preg_replace ( "/</is", "&lt;", $rawstring);
    $rawstring = preg_replace ( "/>/is", "&gt;", $rawstring);
    echo "<br>$rawstring
";
    foreach( $real_matches as $match){
        $result = preg_replace ( "/</is", "&lt;", $match);
        $result = preg_replace ( "/>/is", "&gt;", $result);
         echo "<br>" . $result . "
";
    }
    $num++;

    echo "<br>LAST OPEN TAG: " . $matches[1] . "
";
} 

?>
</body>
</html>