在PHP中行走Dom以替换发现为“HTML文本”的字符串列表

I would like to replace a word list (in array) in a list of links (hrefs in array) into an html page.

I think mostly have 2 options:

  1. Doing this from regular expressions (strongly discouraged to parse and change html).

  2. Using a html parser and walking the DOM for each word and link list to replace.

The problems with the 2nd option is as follows:

  1. I don't want to replace links previously created in the html page, which I have to know for each word found in the list in which tag is located it.

  2. I don't want to replace the words on each node of the DOM, only the nodes that have no children, ie only in the leaves.

Easy Example:

$aURLlist = array('www.google.com','www.facebook.com');
$aWordList = array('Google', 'Facebook');
$htmlContent='<html><body><div>Google Inc. is an American multinational corporation specializing in Internet-related services and products.</div><div>Facebook is an online social networking service, whose name stems from the colloquial name for the book given to students at the start of the academic year by some university administrations in the United States to help students get to know each other.</div></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($htmlContent);
$htmlContent=walkingDom($dom,$aURLlist,$aWordList); //replace all words of $aWordList found in text nodes of $dom TO links with href equal to URL in $aURLlist

Result:

$htmlContent=<html><body><div><a href='www.google.com'>Google</a> Inc. is an American multinational corporation specializing in Internet-related services and products.</div><div><a href='www.facebook.com'>Facebook</a> is an online social networking service, whose name stems from the colloquial name for the book given to students at the start of the academic year by some university administrations in the United States to help students get to know each other.</div></body></html>';

I have a recursive function that walks the DOM with DOMDocument lib, but I can't append a "anchor" node to replace a word found in leaf "text" node.

function walkDom($dom, $node, $element, $sRel, $sTarget, $iSearchLinks, $iQuantityTopics, $level = 0, $bLink = false) {
    $indent = '';
    if ($node->nodeName == 'a') {
        $bLink = true;
    }
    for ($i = 0; $i < $level; $i++)
        $indent .= '&nbsp;&nbsp;';
    if ($node->nodeType != XML_TEXT_NODE) {
        //echo $indent . '<b>' . $node->nodeName . '</b>';
        //echo $indent . '<b>' . $node->nodeValue . '</b>';

        if ($node->nodeType == XML_ELEMENT_NODE) {
            $attributes = $node->attributes;
            foreach ($attributes as $attribute) {
                //echo ', ' . $attribute->name . '=' . $attribute->value;
            }
            //echo '<br>';
        }
    } else {
        if ($bLink || $node->nodeName == 'img' || $node->nodeName == '#cdata-section' || $node->nodeName == '#comment' || trim($node->nodeValue) == '') {
            continue;
            //echo $indent;
            //echo 'NO replace: ';
            //var_dump($node->nodeValue);
            //echo '<br><br>';
        } elseif (!$bLink && $node->nodeName != 'img' && trim($node->nodeValue) != '') {
            //echo $indent;
            //echo "TEXT TO REPLACE: $element, $replace, $node->nodeValue, $iSearchLinks  <br>";
            $i = 0;
            $n = 1;
            while (i != $iSearchLinks && $n > 0 ) {
                //echo "Create link? <br>";

                $node->nodeValue = preg_replace('/'.$element->name.'/', '', $node->nodeValue, 1, $n);
                if ($n > 0) {
                    //echo "Creating link with $element->name <br>";
                    $link = $dom->createElement("a", $element->name);
                    $link->setAttribute("class", "nl_tag");
                    $link->setAttribute("id", "@@ID@@");
                    $link->setAttribute("hreflang", $element->type);
                    $link->setAttribute("title", $element->altname);
                    $link->setAttribute("href", $element->resource);
                    if ($sRel == "nofollow") $link->setAttribute("rel", $sRel);
                    if ($sTarget == "_blank") $link->setAttribute("target", $sTarget);
                    $node->parentNode->appendChild($link);
                    //var_dump($node->parentNode);
                    $dom->encoding = 'UTF-8';
                    $dom->saveHTML();
                    $iQuantityTopics++;
                }
                $i++;
                //saveHTML?
                //echo '<br><br>';
            }
        }
    }

This solution don't work, becouse appendChild function adds new child at the end of the children only, but I want to add it where found word to replace is located.

I've also tried to add link directy with preg_replace function into leaf text node, but the anchor is added as "text format" into text node, and I need to add it as a link node to replace the word within leaf text node where is located.

My question is: is it possible to do this with html parser in PHP, or necessarily I have to resort to regular expressions? Thanks in advance!

@Suamere:

"I'm not sure what the PHP engine doesn't support: (?i)(?<!<[^>]*|>)(strWord)(?!<|[^<]*>)"
(?i) - Yes, although it would be easier to just put i at the end:

/(someregex)/i<br>
(?&lt;!<[^>]\*|>)

You're looking for a leading tag here; I got this to work by deleting the first < (sort of)

So here's what the final regex looked like that was as close as possible to what you're trying to do:

/(?!<[^>]\*>).\*(strWord).\*(?!<\/[^<]\*>)/i<br>

However, a much simpler approach would be something like:

$text = "...";<br>
$words = array('him', 'her', ...);<br>
$links = array('&lt;a href="...">$0&lt;/a>', ...);<br>

foreach ($words as $word) {<br>
&emsp;array_push($regexes, "/\b{$word}\b/i");<br>
}<br>
$modified_array = preg_replace($regexes, $links, $text);<br>

It's important that $words and $links have the exact same number of elements; otherwise an error will be thrown.

$0 references the entire match of the corresponding regex; in this case, only the specific word you're looking for itself.

Also, preg_replace() applies the /g modifier by default, so that modifier is not needed on each regex. :-)