来自xpath查询的元标记内容值是否可信？

I have a php function who extracts meta tags from an url with xpath queries.

e.g $xpath->query('/html/head/meta[@name="my_target"]/@content')

My question :

Can I trust the returned value or should I verify it ?

=> Is there any possible XSS exploit ?

=> Should the html content be purified before loading it in the DOMDocument ?

 // Other way to say it with some code :

    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = false;
    libxml_use_internal_errors(true);

    // is
    $doc->loadHTMLFile($url);
    // trustable ??

    // or is
    file_get_contents($url);
    $trust = $purifier->purify($html);
    $doc->loadHTML($trust);
    // a better practice ??

    libxml_use_internal_errors(false);
    $xpath = new DOMXPath($doc);

    $trustable = $xpath->query('/html/head/meta[@name="my_target"]/@content')->item(0) // ?

===== UPDATE =========================================

Yes, never trust external sources.

use $be_sure = htmlspecialchars($trustable->textContent) or strip_tags($trustable->textContent)

If you pull in HTML content from a source that you don't control, then yes, I would consider that piece of code potentially troublesome!

You could use htmlspecialchars() to convert any special characters to HTML entities. Or if you want to keep parts of the mark-up, you could use strip_tags(). An other option is to use filter_var() which gives you more control over its filtering.

Or you could use a library like HTML Purifier but that might be too much for your end. It all depends on the type of content you are working with.

Now, to sanitise the element, you will need to get the string representation of your XPath result first. Apply your filtering and then put it back in. The following example should do what you want:

<?php
// The following HTML is what you fetch from your remote source:
$html = <<<EOL
<html>
 <body>
    <h1>Foo, bar!</h1>
    <div id="my-target">
        Here is some <strong>text</strong> <script>javascript:alert('some malicious script!');</script> that we want to sanitize.
    </div>
 </body>
</html>
EOL;

// We instantiate a DOCDocument so we can work with it:
$original = new DOMDocument("1.0", 'UTF-8');
$original->formatOutput = true;
$original->loadHTML($html);

$body = $original->getElementsByTagName('body')->item(0);

// Find the element we need using Xpath:
$xpath = new DOMXPath($original);
$divs  = $xpath->query("//body/div[@id='my-target']");

// The XPath query will return DOMElement objects, so create a string that we can manipulate out of it:
$innerHTML  = '';
if (count($divs))
{
    $div = $divs->item(0);

    // Now get the innerHTML for this element
    foreach ($div->childNodes as $child) {
        $innerHTML .= $original->saveXML($child);
    }

    // Remove it from the original document because we want to replace it anyway
    $div->parentNode->removeChild($div);
}

// Sanitize our string by removing all tags except <strong> and the container <div>
$innerHTML = strip_tags($innerHTML, '<strong>');
// or htmlspecialchars() or filter_var or HTML Purifier ..

// Now re-import the sanitized string into a blank DOMDocument
$sanitized = new DOMDocument("1.0", 'UTF-8');
$sanitized->formatOutput = true;
$sanitized->loadXML('<div id="my-target">' . $innerHTML . '</div>');

// Now add the sanitized DOMElement back into the original document as a child of <body>
$body->appendChild($original->importNode($sanitized->documentElement, true));

echo $original->saveHTML();

Hope that helps.