This html I get from the Response.
And I need to remove the extra text.
There is a line of the following content
<?php
$str = <<<HTML
AAAA <span>span txt</span>
<div class='unique_div' id='xrz' data-id='1'>
div text
<span>span text</span>
<p class='unique_p'>
<span>p span text</span>
<p>p p text</p>
</p>
div text
</div>
BBBB <span>span txt</span>
HTML;
How to replace div on p which is inside it?
I need to write a regular expression to get the following result
<?php
$str = <<<HTML
AAAA <span>span txt</span>
<p class='unique_p'>
<span>p span text</span>
<p>p p text</p>
</p>
BBBB <span>span txt</span>
HTML;
There is only one div and p with such attributes.
Since you're looking at what appears to be HTML and given that your requirements entail some form of modification to the Document Object Model (DOM) I would suggest using a DOM parser like DOMDocument
.
If I understood your question correctly, you're looking to replace the <div>
node which appears to have an id
attribute of xrz
with the p
node that has a class attribute of unique_p
and is a child of the div
.
div
is easy, because it has an id
and they are unique. So we can use a method like DOMDocument::getElementById
to get that div
.p
gets a little trickier since we want to make sure it's both a child of div
and has the specified class. So we'll rely on an XPath query for that using DOMXPath
.div
with its captured child p
by using DOMNode::replaceChild
from there.Here's a simple example.
$str = <<<HTML
AAAA <span>span txt</span>
<div class='unique_div' id='xrz' data-id='1'>
div text
<span>span text</span>
<p class='unique_p'>
<span>p span text</span>
<p>p p text</p>
</p>
div text
</div>
BBBB <span>span txt</span>
HTML;
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$children = $xpath->query('//div/p[@class="unique_p"]');
$p = $children->item(0);
$div = $dom->getElementById('xrz');
$div->parentNode->replaceChild($p, $div);
echo $dom->saveHTML();
The output should look something like this.
<p>AAAA <span>span txt</span> <p class="unique_p"> <span>p span text</span> </p><p> BBBB <span>span txt</span></p></p>
In case you're wondering why the output may appear slightly different than what you might expect, it's important to note that your initial HTML, provided in your question, is actually malformed.
See section 9.3.1 of the HTML 4.01 specification
The
P
element represents a paragraph. It cannot contain block-level elements (includingP
itself).
So each time a DOM parser finds an opening p
tag inside of another p
tag it will just implicitly close the previous one first.