PHP:DOMDocument:从嵌套元素中删除不需要的文本

I have the following xml document:

<?xml version="1.0" encoding="UTF-8"?>
<header level="2">My Header</header>
<ul>
    <li>Bulleted style text
        <ul>
            <li>
                <paragraph>1.Sub Bulleted style text</paragraph>
            </li>
        </ul>
    </li>
</ul>
<ul>
    <li>Bulleted style text <strong>bold</strong>
        <ul>
            <li>
                <paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
            </li>
        </ul>
    </li>
</ul>

I need to remove the numbers preceeding the Sub-bulleted text. 1. and 2. in the given example

This is the code I have so far:

<?php
class MyDocumentImporter
{
    const AWKWARD_BULLET_REGEX = '/(^[\s]?[\d]+[\.]{1})/i';

    protected $xml_string = '<some_tag><header level="2">My Header</header><ul><li>Bulleted style text<ul><li><paragraph>1.Sub Bulleted style text</paragraph></li></ul></li></ul><ul><li>Bulleted style text <strong>bold</strong><ul><li><paragraph>2.Sub Bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>';

    protected $dom;

    public function processListsText( $loop = null ){

        $this->dom = new DomDocument('1.0', 'UTF-8');

        $this->dom->loadXML($this->xml_string);

        if(!$loop){
            //get all the li tags
            $li_set = $this->dom->getElementsByTagName('li');
        }
        else{
            $li_set = $loop;
        }

        foreach($li_set as $li){

            //check for child nodes
            if(! $li->hasChildNodes() ){
                continue;
            }

            foreach($li->childNodes as $child){
                if( $child->hasChildNodes() ){
                    //this li has children, maybe a <strong> tag
                    $this->processListsText( $child->childNodes );
                }
                if( ! ( $child instanceof DOMElement ) ){
                    continue;
                }
                if( ( $child->localName != 'paragraph') ||  ( $child instanceof DOMText )){
                    continue;
                }
                if( preg_match(self::AWKWARD_BULLET_REGEX, $child->textContent) == 0 ){
                    continue;
                }

                $clean_content = preg_replace(self::AWKWARD_BULLET_REGEX, '', $child->textContent);

                //set node to empty
                $child->nodeValue = '';

                //add updated content to node
                $child->appendChild($child->ownerDocument->createTextNode($clean_content));

                //$xml_output = $child->parentNode->ownerDocument->saveXML($child);
                //var_dump($xml_output);

            }
        }
    }
}

$importer = new MyDocumentImporter();
$importer->processListsText();

The issue I can see is that $child->textContent returns the plain text content of the node, and strips the additional child tags. So:

<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>

becomes

<paragraph>Sub Bulleted bold</paragraph>

The <strong> tag is no more.

I'm a little stumped... Can anyone see a way to strip the unwanted characters, and retain the "inner child" <strong> tag?

The tag may not always be <strong>, it could also be a hyperlink <a href="#">, or <emphasize>.

Assuming your XML actually parses, you could use XPath to make your queries a lot easier:

$xp = new DOMXPath($this->dom);

foreach ($xp->query('//li/paragraph') as $para) {
        $para->firstChild->nodeValue = preg_replace('/^\s*\d+.\s*/', '', $para->firstChild->nodeValue);
}

It does the text replacement on the first text node instead of the whole tag contents.

You resetting its whole content, but what you want is only to alter the first text node (keep in mind text nodes are nodes too). You might want to look for the xpath //li/paragraph/text()[position()=1], and work on / replace that DOMText node instead of the whole paragraph content.

$d = new DOMDocument();
$d->loadXML($xml);
$p = new DOMXPath($d);
foreach($p->query('//li/paragraph/text()[position()=1]') as $text){
        $text->parentNode->replaceChild(new DOMText(preg_replace(self::AWKWARD_BULLET_REGEX, '', $text->textContent),$text);
}