$fileSource = "http://google.com";
$ch = curl_init($fileSource);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_exec($ch);
$retcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($retcode != 200) {
$error .= "The source specified is not a valid URL.";
}
curl_close($ch);
Here's my issue. When I use the above and set $fileSource = "http://google.com";
it does not work, whereas if I set it to $fileSource = "http://www.google.com/";
it works.
What's the issue?
One permanently redirects (301) to the www.
domain, while the other one just replies OK (200).
Why are you only considering only the 200 status code as valid? Let CURL handle that for you:
curl_setopt($ch, CURLOPT_FAILONERROR, true);
From the manual:
TRUE to fail silently if the HTTP code returned is greater than or equal to 400. The default behavior is to return the page normally, ignoring the code.
Try explicitly telling curl to follow redirects
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
If that doesn't work you may need to spoof a user agent on some sites.
Also, if they are using JS redirects your out of luck.
What you're seeing is actually a result of a 301 redirect. Here's what I got back using a verbose curl from the command line
curl -vvvvvv http://google.com
* About to connect() to google.com port 80 (#0)
* Trying 173.194.43.34...
* connected
* Connected to google.com (173.194.43.34) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.25.0 (x86_64-apple-darwin11.3.0) libcurl/7.25.0 OpenSSL/1.0.1b zlib/1.2.6 libidn/1.22
> Host: google.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://www.google.com/
< Content-Type: text/html; charset=UTF-8
< Date: Fri, 04 May 2012 04:03:59 GMT
< Expires: Sun, 03 Jun 2012 04:03:59 GMT
< Cache-Control: public, max-age=2592000
< Server: gws
< Content-Length: 219
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
<
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
* Connection #0 to host google.com left intact
* Closing connection #0
However, if you do a curl on the actual www.google.com suggested in the 301 redirect, you'll get the following.
curl -vvvvvv http://www.google.com
* About to connect() to www.google.com port 80 (#0)
* Trying 74.125.228.19...
* connected
* Connected to www.google.com (74.125.228.19) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.25.0 (x86_64-apple-darwin11.3.0) libcurl/7.25.0 OpenSSL/1.0.1b zlib/1.2.6 libidn/1.22
> Host: www.google.com
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Fri, 04 May 2012 04:05:25 GMT
< Expires: -1
< Cache-Control: private, max-age=0
< Content-Type: text/html; charset=ISO-8859-1
I've truncated the remainder of google's response just to show the primary difference in the 200 OK vs 301 REDIRECT