XML或PHP或某人正在吃我的角色

I'm one bad question away from being banned from the site, but this one is worth it to me, I've spent hours on end trying to find the issue and debug it, and I simply can't, I've searched high and low for the answers and I'm clueless

I'm using the PHP DOM Document Parser Object, and I'm importing a Wikipedia XML template. For hours I was using substr() and my answers were coming back out by like 14 or so characters. So to cut a long story short, it turns out the discrepency is coming from the > and the < that I have in some of the elements

I've tried everything I can think of, everything is UTF-8, I've tried type casting to strings, my headers are not being sent as XML, it is a normal HTML output, I've tried, I've tried mb_substr() and substr()

str_replace('<', '&lt;', $string);

It's like no matter what I do I can't stop those characters from disappearing into the abyss, but I don't know where they're going

Hope someone can shed some light on it

Edit: To clear it up a bit, I've downloaded an XML file straight from wikipedia, one line in it for example is:

&lt;small&gt;(1, 2, 3, 4, 33, 34, 64, 65, 66)&lt;big&gt;&lt;br/&gt;

Now if I use:

dd(mb_substr($str, 1, 2))

I'd expect "lt", but in reality what I'm getting is "sm", it's treating the "<" as a single character, but if I open up the file in Sublime, Notepad++, EmEditor etc, it is 4 characters

I don't understand how PHP is treating the string, even if I use str_replace() it refuses to become a HTML entity

Edit2:

If you go to this address:

https://en.wikipedia.org/wiki/Special:Export

and type "London" in the box, it will download an XML file

In a class or whereever, use this code:

    $this->file = new \DOMDocument;
    $this->file->load('C:\path-to-your-xm-file.xml');
    $pages = $this->file->getElementsByTagName('page');

    foreach($pages as $page)
    {
        die(mb_substr($page->getElementsByTagName('text')->item(0)->nodeValue, 343, 1));
    }

Now the 344th character should be an ampersand, but instead it gives the entire "<"

In my understanding it is about XML parser(s). By XML standard three characters must be encoded and decoded back as:

‘< ‘ to <

> to >

& to &

Then any (and all) parsers must do the next:

Let say you need set text node (or attribute value) to string as < my text & some more >
then assume it is a text node in the XML tag <TextValue>

According to XML standard such text can be presented in XML document in two forms: <TextValue>< my text & some more ></TextValue>

<TextValue><![CDATA[<my text & some more>]]></TextValue>

Now any parser who must return value of text node from both presentations, must return actual string value, not XML encoded presentation.
Because actual string is < my text & some more > parser performs XML decoding and returns you that actual string value.

It is not related to actual Parser implementation (PHP, Java, DOM in Browsers or anything else. It is a standard.

PS. If you have any XML tool with XPath capability in hands you can play with that example and see defined behavior exactly.

UPD: So your XML presentation is: <small>(1, 2, 3, 4, 33, 34, 64, 65, 66)<big><br/>

Then actual string is <small>(1, 2, 3, 4, 33, 34, 64, 65, 66)<big><br/> and of course string length is 49 not 67 and mb_substr($str, 1, 2) returns exactly sm from actual string value and not lt from XML encoded presentation