I'm try to parse some html content, here's the HTML content:
<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>
So to parse this and get the "Event Name", "Event Time" and "Stream Number", I'm doing this:
preg_match_all('/<\/font>\s*([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2}).*?tream\s*(.*?)\s*<\/font><p>/', $data, $matches);
And It returns everything correctly, however stream number with http link is also returned which i don't want. I just want the name (For some) & number only.
Data Needed:
5
4
CHANNEL TWO 2 STREAM
16
2
CHANNEL THREE 3 STREAM
Currently it returns:
5
4
-online.html
16
2
-online.html
Can anyone please help? Not a pro in regex, been trying for last 2 days. Thanks in advance!!!
This expression will:
Stream
if it exists<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?green['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(?:Stream\s*)?((?:(?!<\/font>).)*)<\/font>\s*[^<]*?([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2})[^<]*?<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?gold['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>(?:Stream\s*)?((?:(?!\s*https?:|<\/font>).)*)
Sample Text
Group 0 gets the entire match
Group 1 gets the title
Group 2 gets the event name
Group 3 gets the event time
Group 4 gets the stream number
<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>
PHP Code Example
<?php
$sourcestring="your source string";
preg_match_all('/<font(?=\s|>)(?=(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\scolor=[\'"]?green[\'"]?)(?:[^>=|&)]|=\'(?:[^\']|\\')*\'|="(?:[^"]|\\")*"|=[^\'"][^\s>]*)*>\s*(?:Stream\s*)?((?:(?!<\/font>).)*)<\/font>\s*[^<]*?([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2})[^<]*?<font(?=\s|>)(?=(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\scolor=[\'"]?gold[\'"]?)(?:[^>=|&)]|=\'(?:[^\']|\\')*\'|="(?:[^"]|\\")*"|=[^\'"][^\s>]*)*>(?:Stream\s*)?((?:(?!\s*https?:|<\/font>).)*)
/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
[0][0] = <font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5
[0][1] = *TITLE*
[0][2] = Some Event Name
[0][3] = 1:15pm-5:00pm
[0][4] = 5
[1][0] = <font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4
[1][1] = *TITLE*
[1][2] = Some: Event Name
[1][3] = 1:30pm-5:00pm
[1][4] = 4
[2][0] = <font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM
[2][1] = *TITLE*
[2][2] = Some, Event Name 1 with num
[2][3] = 1:30pm-7:30pm
[2][4] = CHANNEL TWO 2 STREAM
[3][0] = <font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16
[3][1] = *TITLE*
[3][2] = Event two
[3][3] = 2.45pm-4.45pm
[3][4] = 16
[4][0] = <font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2
[4][1] = *TITLE*
[4][2] = Event THREE summary
[4][3] = 2.45pm-4.45pm
[4][4] = 2
[5][0] = <font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM
[5][1] = *TITLE*
[5][2] = Event with a lot of summary
[5][3] = 4:00pm-6:00pm
[5][4] = CHANNEL THREE 3 STREAM
But, if you want it in regex then based on your data you need this
preg_match_all('/(?:<\/font> )((?:[^0-9]+(?:[0-9](?!\.|:|[0-9]))?(?:[0-9]{2}(?!\.|:))?)*)([^<]+) <[^>]+>(?:Stream )?([^h<]+)/', $data, $matches);
This will put the names in $matches[1]
, the times in $matches[2]
and the channels in $matches[3]
Explanation of the regex:
(?:<\/font> )
search for (and ignore) first closing font tag on new line, include the space((?:[^0-9]+(?:[0-9](?!\.|:|[0-9]))?(?:[0-9]{2}(?!\.|:))?)*)
grab everything that's not one or two numbers unless said numbers are followed by a dot or colon (use negative lookahead), repeat as needed and group as one([^<]+)
grab everything up to the next "<", but not the trailing space<[^>]+>
ignore everythign untill the next ">" and ignore the ">" as well(?:Stream )?
if first word is "Stream " ignore it([^h<]+)
grab everything untill either a lower-case "h" or a "<"This expression will:
Stream
if it is the first word<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?gold['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>(?:Stream\s*)?\K(?:(?!\s*https?:|<\/font>).)*
Live Demo hover over the blue blocks to see why they were matched
Sample Text
<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>
Matches
[0] => 5
[1] => 4
[2] => CHANNEL TWO 2 STREAM
[3] => 16
[4] => 2
[5] => CHANNEL THREE 3 STREAM