I am trying to extract the text from the below html structure using xpath, The xpath expression i am using is
'//div[@class="descr_id"]/descendant-or-self::*/text()'
But the array I get from above, does change the order of the text, it first gives me all the descendant and then self text while I plan to exactly get all the text in below kind of html structure in the same order like "This text 1 This text 2 This text 3.........".
<div class="descr_id">
This text 1
<a href="www.example.com">This text 2</a>
This text 3
<a href="www.example2.com">This text 4</a>
This text main 5
<ul>
<li>
This text 6</li>
<li>
This text 7</li>
</ul>
</div>
Try http://sandbox.onlinephpfunctions.com/code/99f45357f08f3833773ba7ada0f5fbf6a4b7180c which does
$html = <<<EOD
<div class="descr_id">
This text 1
<a href="www.example.com">This text 2</a>
This text 3
<a href="www.example2.com">This text 4</a>
This text main 5
<ul>
<li>
This text 6</li>
<li>
This text 7</li>
</ul>
</div>
EOD;
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$textNodes = $xpath->query('//div[@class="descr_id"]//text()[normalize-space()]');
foreach ($textNodes as $text)
{
echo "$text->nodeValue
";
}
and outputs the text
node descendants in document order. You might want to trim the values however if you want e.g. This text 1
without the leading and/or trailing white space.
You haven't explained clearly what output you are actually getting.
Technically XPath 1.0 is defined to return a node-set - that is, a set of nodes in no particular order. In practice, all XPath 1.0 processors that I have come across return a sequence of nodes in document order (probably because this is what XSLT 1.0 requires).
You've tagged the question XPath 2.0, which is defined to return a sequence of nodes in document order for this expression. But since you are using PHP, I strongly suspect you are using XPath 1.0 and the tag is a red herring.
If your XPath processor isn't returning results in document order then it may be worth rewriting the expression to //div[@class="descr_id"]/descendant::text()
to see if that makes any difference. It's shorter anyway.