I want to remove all elements which are not closed properly at the end of content e.g in below test
commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea
voluptate velit esse quam nihil molestiae consequatur,
vel illum qui dolorem eum fugiat quo voluptas nulla
pariatur? <a rel="nofollow" class="underline"
I want to remove
<a rel="nofollow" class="underline"
or elements without closing tags
<h2>sample text
or any other html element which is not closed properly at the end.
I have written a function that should do what you want. The idea is to first replace all valid tag-sequences with a ####
pattern. Then a regular expression removes everything from the first <
till the end of the string. After that, the valid tag-sequences are put back to the buffer (if that part has not been removed due to invalid tag before that part).
Too bad, I can't add a codepad because recursive regular expressions seems not to be supported by the PHP version used by codepad. I've tested this with PHP 5.3.5.
PHP
function StripUnclosedTags($input) {
// Close <br> tags
$buffer = str_ireplace("<br>", "<br/>", $input);
// Find all matching open/close HTML tags (using recursion)
$pattern = "/<([\w]+)([^>]*?) (([\s]*\/>)| (>((([^<]*?|<\!\-\-.*?\-\->)| (?R))*)<\/\\1[\s]*>))/ixsm";
preg_match_all($pattern, $buffer, $matches, PREG_OFFSET_CAPTURE);
// Mask matching open/close tag sequences in the buffer
foreach ($matches[0] as $match) {
$ofs = $match[1];
for ($i = 0; $i < strlen($match[0]); $i++, $ofs++)
$buffer[$ofs] = "#";
}
// Remove unclosed tags
$buffer = preg_replace("/<.*$/", "", $buffer);
// Put back content of matching open/close tag sequences to the buffer
foreach ($matches[0] as $match) {
$ofs = $match[1];
for ($i = 0; $i < strlen($match[0]) && $ofs < strlen($buffer); $i++, $ofs++)
$buffer[$ofs] = $match[0][$i];
}
return $buffer;
}
$str = 'commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate '
.'velit esse<br> quam nihil molestiae consequatur, vel illum qui dolorem eum '
.'fugiat quo voluptas nulla pariatur? '
.'<a href="test">test<p></p></a><span>test<p></p>bla';
var_dump(StripUnclosedTags($str));
Output
string 'commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea
voluptate velit esse<br/> quam nihil molestiae consequatur,
vel illum qui dolorem eum fugiat quo voluptas nulla
pariatur? <a href="test">test<p></p></a>' (length=226)