从HTML中抓取唯一的图片网址

Using PHP to curl a web page (some URL entered by user, let's assume it's valid). Example: http://www.youtube.com/watch?v=Hovbx6rvBaA

I need to parse the HTML and extract all de-duplicated URL's that seem like an image. Not just the ones in img src="" but any URL ending in jpe?g|bmp|gif|png, etc. on that page. (In other words, I don't wanna parse the DOM but wanna use RegEx).

I plan to then curl the URLs for their width and height information and ensure that they are indeed images, so don't worry about security related stuff.

Collect all image urls into an array, then use array_unique() to remove duplicates.

$my_image_links = array_unique( $my_image_links );
// No more duplicates

If you really want to do this w/ a regex, then we can assume each image name will be surrounded by either ', ", or spaces, tabs, or line breaks or beginning of line, >, <, and whatever else you can think of. So, then we can do:

$pattern = '/[\'" >\t^]([^\'" 
\t]+\.(jpe?g|bmp|gif|png))[\'" <
\t]/i';
preg_match_all($pattern, html_entity_decode($resultFromCurl), $matches);
$imgs = array_unique($matches[1]);

The above will capture the image link in stuff like:

<p>Hai guys look at this ==> http://blah.com/lolcats.JPEG</p>

Live example

What's wrong with using the DOM? It gives you much better control over the context of the information and a much higher likelihood that the things you pull out are actually URLs.

<?php
$resultFromCurl = '
    <html>
    <body>
    <img src="hello.jpg" />
    <a href="yep.jpg">Yep</a>
    <table background="yep.jpg">
    </table>
    <p>
        Perhaps you should check out foo.jpg! I promise it 
        is safe for work.
    </p>
    </body>
    </html>
';

// these are all the attributes i could think of that
// can contain URLs.
$queries = array(
    '//table/@background',
    '//img/@src',
    '//input/@src',
    '//a/@href',
    '//area/@href',
    '//img/@longdesc',
);

$dom = @DOMDocument::loadHtml($resultFromCurl);
$xpath = new DOMXPath($dom);

$urls = array();
foreach ($queries as $query) {
    foreach ($xpath->query($query) as $link) {
        if (preg_match('@\.(gif|jpe?g|png)$@', $link->textContent))
            $urls[$link->textContent] = true;
    }
}

if (preg_match_all('@\b[^\s]+\.(?:gif|jpe?g|png)\b@', $dom->textContent, $matches)) {
    foreach ($matches as $m) {
        $urls[$m[0]] = true;
    }
}

$urls = array_keys($urls);
var_dump($urls);