如何忽略字符串中的http链接并返回其他所有内容?

I'm try to parse some html content, here's the HTML content:

<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>

So to parse this and get the "Event Name", "Event Time" and "Stream Number", I'm doing this:

preg_match_all('/<\/font>\s*([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2}).*?tream\s*(.*?)\s*<\/font><p>/', $data, $matches);

And It returns everything correctly, however stream number with http link is also returned which i don't want. I just want the name (For some) & number only.

Data Needed:

5
4
CHANNEL TWO 2 STREAM
16
2
CHANNEL THREE 3 STREAM

Currently it returns:

5
4
-online.html
16
2
-online.html

Can anyone please help? Not a pro in regex, been trying for last 2 days. Thanks in advance!!!

Description

This expression will:

  • Capture the title
  • Capture the Event Name
  • Capture the event time
  • find all the font tags which have color=gold
  • skip over the word Stream if it exists
  • capture the interesting text
  • stop capturing when it reaches an http:// link
  • Trim pesky white space from around the matches
  • Overall the expression allows the font tag attributes to appear anywhere inside the font tag. And the expression will avoid some really difficult edge cases

<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?green['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(?:Stream\s*)?((?:(?!<\/font>).)*)<\/font>\s*[^<]*?([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2})[^<]*?<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?gold['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>(?:Stream\s*)?((?:(?!\s*https?:|<\/font>).)*)

Examples

Live Demo

Sample Text

Group 0 gets the entire match
Group 1 gets the title
Group 2 gets the event name
Group 3 gets the event time
Group 4 gets the stream number

<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>

PHP Code Example

<?php
$sourcestring="your source string";
preg_match_all('/<font(?=\s|>)(?=(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\scolor=[\'"]?green[\'"]?)(?:[^>=|&)]|=\'(?:[^\']|\\')*\'|="(?:[^"]|\\")*"|=[^\'"][^\s>]*)*>\s*(?:Stream\s*)?((?:(?!<\/font>).)*)<\/font>\s*[^<]*?([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2})[^<]*?<font(?=\s|>)(?=(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\scolor=[\'"]?gold[\'"]?)(?:[^>=|&)]|=\'(?:[^\']|\\')*\'|="(?:[^"]|\\")*"|=[^\'"][^\s>]*)*>(?:Stream\s*)?((?:(?!\s*https?:|<\/font>).)*)
/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

[0][0] = <font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5
[0][1] = *TITLE* 
[0][2] = Some Event Name
[0][3] = 1:15pm-5:00pm
[0][4] = 5

[1][0] = <font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4
[1][1] = *TITLE* 
[1][2] = Some: Event Name
[1][3] = 1:30pm-5:00pm
[1][4] = 4

[2][0] = <font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM
[2][1] = *TITLE* 
[2][2] = Some, Event Name 1 with num
[2][3] = 1:30pm-7:30pm
[2][4] = CHANNEL TWO 2 STREAM

[3][0] = <font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16
[3][1] = *TITLE* 
[3][2] = Event two
[3][3] = 2.45pm-4.45pm
[3][4] = 16

[4][0] = <font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2
[4][1] = *TITLE* 
[4][2] = Event THREE summary
[4][3] = 2.45pm-4.45pm
[4][4] = 2

[5][0] = <font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM
[5][1] = *TITLE* 
[5][2] = Event with a lot of summary
[5][3] = 4:00pm-6:00pm
[5][4] = CHANNEL THREE 3 STREAM

But, if you want it in regex then based on your data you need this

preg_match_all('/(?:<\/font> )((?:[^0-9]+(?:[0-9](?!\.|:|[0-9]))?(?:[0-9]{2}(?!\.|:))?)*)([^<]+) <[^>]+>(?:Stream )?([^h<]+)/', $data, $matches);

This will put the names in $matches[1], the times in $matches[2] and the channels in $matches[3]


Explanation of the regex:

  1. (?:<\/font> ) search for (and ignore) first closing font tag on new line, include the space
  2. ((?:[^0-9]+(?:[0-9](?!\.|:|[0-9]))?(?:[0-9]{2}(?!\.|:))?)*) grab everything that's not one or two numbers unless said numbers are followed by a dot or colon (use negative lookahead), repeat as needed and group as one
  3. ([^<]+) grab everything up to the next "<", but not the trailing space
  4. <[^>]+> ignore everythign untill the next ">" and ignore the ">" as well
  5. (?:Stream )? if first word is "Stream " ignore it
  6. ([^h<]+) grab everything untill either a lower-case "h" or a "<"

Description

This expression will:

  • find all the font tags which have class "gold"
  • skip over the word Stream if it is the first word
  • capture the interesting text
  • stop capturing when it reaches an http:// link

<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?gold['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>(?:Stream\s*)?\K(?:(?!\s*https?:|<\/font>).)*

enter image description here

Examples

Live Demo hover over the blue blocks to see why they were matched

Sample Text

<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>

Matches

[0] => 5
[1] => 4
[2] => CHANNEL TWO 2 STREAM
[3] => 16
[4] => 2
[5] => CHANNEL THREE 3 STREAM