PHP和XML:如何删除“终端元素”之外的所有空格

First let's define "terminal element" (for the particular purpose of this question).

By "terminal element" I mean the elements that contain no other elements inside.

Element reference: http://www.w3schools.com/xml/xml_elements.asp

How to remove from a XML document/node all whitespaces (line feeds, carriage returns, tabs and spaces) that are outside "terminal elements" with PHP?

Rules: Only PHP native XML parsers (no regex).

All whitespace outside "terminal elements" (leaf element nodes) is within text-nodes (as all text is within text-nodes). So if you get all text-nodes that are outside of terminal elements, you can remove all whitespace-characters from those. This is the answer already.

Let's start lightly by just removing whitespace from one text-node in an XML Document.

As PHP uses UTF-8 as character encoding for the XML parsers (I use DOMDocument in this example), preg_replace is handy here as it knows both UTF-8 and what whitespace characters are:

/** @var DomText $text */
$text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);

This removes all whitespace from a text-node. Here is a demonstration of that:

$doc = new DOMDocument();
$doc->loadXML('<root> Very Simple Demo </root>');

$text = $doc->documentElement->childNodes->item(0);

/** @var DomText $text */
$text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);

$doc->save('php://output');

Output:

<?xml version="1.0"?>
<root>VerySimpleDemo</root>

As you can see the space characters are removed from the one and only text-node that is part of that document.

With a larger document and your "terminal elements", this is naturally more interesting, but works pretty much the same. The only difference is to get all text-node that are not part of leaf-element-nodes. This is best done with an xpath query:

//*[*]/text()

This reads: All text-nodes that are children of element that contain other elements. Let's use the following XML (file content.xml) as an example:

<?xml version="1.0"?>
<content>
    <parent>
        <child id="1">
            <title>child 1</title>

            <child id="1">
                <title>
                    child 1.1 with whitespace
                </title>
            </child>
        </child>
    </parent>
</content>

It contains both such leaf-elements as well as other elements that have child-elements. It also shows pretty well the whitespace as it's used for element indentation.

After loading it:

$file = __DIR__ . '/content.xml';

$doc = new DOMDocument();
$doc->load($file);

A DOMXPath is necessary to execute the xpath-query:

$xp    = new DOMXPath($doc);
$texts = $xp->query('//*[*]/text()');

What's left is to iterate over all those text-nodes and apply the whitespace removal as above:

foreach ($texts as $text) {
    /** @var DomText $text */
    $text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);
}

The result then is:

<?xml version="1.0"?>
<content><parent><child id="1"><title>child 1</title><child id="1"><title>
                    child 1.1 with whitespace
                </title></child></child></parent></content>

This should answer the question. But it wouldn't be XML if there wouldn't be a little bit more verbosity or a little kind of "but...".

Note that "text()" in xpath represents all kind of text-nodes incl. CDATA sections. If a CDATA section contains of whitespace only, the code above renders an empty CDATA section ("<![CDATA[]]>") into the output. One way to deal with that is to remove the the empty nodes from the document:

/** @var DomText $text */
$text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);
if (!$text->length) {
    $text->parentNode->removeChild($text);
}

This then removes all emptied text-nodes form the document then. Keeping the document tree tidy. Hope this helps.

DOMDocument::normalizeDocument may do what you're looking for.

If you want to normalize individual elements, you can call DOMNode::normalize