I have just started to learn how to use regular expressions to extract data from websites. The first goal of mine is to extract the title of a website. Here is what my code is like:
<?php
$data = file_get_contents('http://bctia.org');
$regex = '/<title>(.+?)<\/title>/';
preg_match($regex,$data,$match);
var_dump($match);
?>
The result of var_dump is empty:
array(0) { }
At first I thought, "maybe bctia.org does not have a title"? However, this is not the case, as I have checked the source of bctia.org, and it does have content between <title>
and </title>
.
Then I thought, maybe my code does not work? However, this is not the case either, as I have substituted bctia.org
with other websites, say, bing.com
, or apple.com
, and they both returned correct results. For example, with apple.com
I get the correct result
array(2) { [0]=> string(20) "" [1]=> string(5) "Apple" }
So I have to come to the conclusion that bctia.org
is a very special website that prevents me from extracting its title...
I am wondering if that is actually the case? Or maybe my code has some problems that I have not identified?
Thank you in advance!
This specific website's server-side code assumes that the client sends a User-Agent
header, and apparently, your PHP installation is not configured to send one. So a 500 Internal Server Error
is returned, causing file_get_contents
to return false
.
Source Error:
Line 66: //LOAD: Compatibility Mode
Line 67: //<meta http-equiv="X-UA-Compatible" content="IE=7,IE=9" />
Line 68: string BrowserOS = Request.ServerVariables["HTTP_USER_AGENT"].ToString();
Line 69: HtmlMeta compMode = new HtmlMeta();
Line 70: compMode.Content = "IE=7,IE=9";
Source File: c:\inetpub\wwwroot\BCTIA\Website\bctia\layouts\Main Layout.aspx.cs
Line: 68
Stack Trace:
[NullReferenceException: Object reference not set to an instance of an object.]
Layouts.Main_Layout.Page_Load(Object sender, EventArgs e) in c:\inetpub\wwwroot\BCTIA\Website\bctia\layouts\Main Layout.aspx.cs:68
System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr fp, Object o, Object t, EventArgs e) +24
System.Web.UI.Control.LoadRecursive() +70
System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +3063
To work around this issue, you can just set a user-agent string before making the request:
ini_set('user_agent', 'Mozilla/5.0 (compatible; Examplebot/0.1; +http://www.example.com/bot.html)');
Do not use regular expression .. !!
Instead Use xpath take a look at: xpath
regular expression will not work well.
Parsing html code with regex isn't a good way because you may be surprised by his permissive structure.
The reason why your pattern don't work is that the dot don't match newlines.
If you want that the dot matches newlines use the s modifier at the end of the pattern, or don't use the dot:
$regex = '/<title>(.+?)<\/title>/s';
or
$regex = '/<title>([^<]+)<\/title>/';
[^<]
is a character class that contains all characters but <
, as you can see with that you don't need to use a lazy quantifier: +
instead of +?