提取html标签外的文本[关闭]

I am trying to extract text using preg_match() which is not contained in tags like <p> or <img>. This text is retrieved from a database and I am working in PHP.

This should be extracted <p>I do not want this</p> This should be extracted <a>This may appear after other tags and I do not want this</a>

I have tried to do (.*)(<p>|<a>|<\/p>|<\/a>)(.*) but this will capture everything up till the last tag and the earlier tags are captured together with text outside of tags.

I have tried to search on Stackoverflow like this: Match text outside of html tags but the regex provided has a pattern error when I pasted it in regex101.com.

Would appreciate any help on this, thanks.

You can use PHP's DOMDocument and DOMXPath to get the values that you want. The trick is to wrap the HTML from your database in a (for example) <div> tag, and you can then load it into a DOMDocument and use DOMXPath to search for children of the <div> tag which are purely text using the text() path:

$html = 'This should be extracted <p>I do not want this</p> This should also be extracted <a>This may appear after other tags and I do not want this</a>';
$doc = new DOMDocument();
$doc->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$texts = array();
foreach ($xpath->query('/div/text()') as $text) {
    $texts[] = $text->nodeValue;
}
print_r($texts);

Output:

Array ( 
    [0] => This should be extracted
    [1] =>  This should also be extracted 
)

Demo on 3v4l.org