I am trying to extract text using preg_match()
which is not contained in tags like <p>
or <img>
. This text is retrieved from a database and I am working in PHP.
This should be extracted <p>I do not want this</p> This should be extracted <a>This may appear after other tags and I do not want this</a>
I have tried to do (.*)(<p>|<a>|<\/p>|<\/a>)(.*)
but this will capture everything up till the last tag and the earlier tags are captured together with text outside of tags.
I have tried to search on Stackoverflow like this: Match text outside of html tags but the regex provided has a pattern error when I pasted it in regex101.com.
Would appreciate any help on this, thanks.
You can use PHP's DOMDocument
and DOMXPath
to get the values that you want. The trick is to wrap the HTML from your database in a (for example) <div>
tag, and you can then load it into a DOMDocument
and use DOMXPath
to search for children of the <div>
tag which are purely text using the text()
path:
$html = 'This should be extracted <p>I do not want this</p> This should also be extracted <a>This may appear after other tags and I do not want this</a>';
$doc = new DOMDocument();
$doc->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$texts = array();
foreach ($xpath->query('/div/text()') as $text) {
$texts[] = $text->nodeValue;
}
print_r($texts);
Output:
Array (
[0] => This should be extracted
[1] => This should also be extracted
)