正则表达式匹配HTML标记内的文本

I'm trying to write a regex that will remove HTML tags around a placeholder text, so that this:

<p>
    Blah</p>
<p>
    {{{body}}}</p>
<p>
    Blah</p>

Becomes this:

<p>
    Blah</p>
{{{body}}}
<p>
    Blah</p>

My current regex is /<.+>.*\{\{\{body\}\}\}<\/.+>/msU. However, it will also remove the contents of the tag preceding the placeholder, resulting in:

{{{body}}}
<p>
    Blah</p>

I can't assume the users will always place the placeholder inside <p>, so I would like it to remove any pair of tags immediately around the placeholder. I would appreciate some help with correcting my regex.

[EDIT]

I think it's important to note that the input may or may not be processed by CKEditor. It adds newlines and tabs to the opening tags, thus the regex needs to go with the /sm (dotall + multiline) modifiers.

Try this:

<[^>]+>\s*\{{3}body\}{3}\s*<\/[^>]+>

See it here in action: http://regexr.com?30s4o

Here's the breakdown:

  • <[^>]+> matches an opening HTML tag, and only that.
  • \s* captures any whitespace (equivalent to [ \t ]*)
  • \{{3} matches a { exactly 3 times
  • body matches the string literally
  • \}{3} matches a } exactly 3 times
  • \s* again, captures any whitespace
  • <\/[^>]+> matches a closing HTML tag

does php strip_tags doesn't work for your case?

http://php.net/manual/en/function.strip-tags.php

<?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "
";

// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>