I'm trying to get the image urls from html source code using the following regex, but it fails when the image url has spaces in it. For example this url:
<img src="http://a57.foxnews.com/global.fncstatic.com/static/managed/img/Entertainment/876/493/kazantsev pink bikini reuters.jpg?ve=1&tl=1" alt="kazantsev pink bikini reuters.jpg" itemprop="image">
$image_regex_src_url = '/<img[^>]*'.'src=[\"|\'](.*)[\"|\']/Ui';
preg_match_all($image_regex_src_url, $string, $out, PREG_PATTERN_ORDER);
This gives me back the following.
http://a57.foxnews.com/global.fncstatic.com/static/managed/img/Entertainment/876/493/kazantsev
Is there a way to match any character including whitespace? Or is it something I have to set in the php configuration?
You have several issues with your regular expression.
First, you are trying to use the concatenation operator ('.'
) to join both parts of your expression together ( this is not necessary ). Secondly, you don't need to use the alternation operator |
inside of your character classes.
The dot .
will match any character except newline sequence. It is a possibility that these tags could possibly include line breaks since they are located in HTML source. You could use the s
(dotall) modifier which forces the dot to match any character including line breaks or use a negated character class meaning match any character except.
Using the s
(dotall) modifier:
$image_regex_src_url = '/<img[^>]*src=(["\'])(.*?)\1/si';
Using a negated character class [^ ]
$image_regex_src_url = '/<img[^>]*src=(["\'])([^"\']*)\1/i';
Although, it is much easier to use a parser such as DOM to grab the results.
$doc = new DOMDocument;
@$doc->loadHTML($html); // load the HTML
foreach($doc->getElementsByTagName('img') as $node) {
$urls[] = $node->getAttribute('src');
}
print_r($urls);