I'm trying to match the following:
This:
HIGH SCHOOL WRESTLING NOTEBOOK: A surge at Delaware Valley, team rankings shakeup and more.
With This :
<pre>
<div class="sum">
<div class="photo_gutter">
<div class="photo">
<a href="http://media.lehighvalleylive.com/brad-wilson/photo/jaryd-flank-b30e919c41bc86b2.jpg">
<img src="http://media.lehighvalleylive.com/brad-wilson/photo/jaryd-flank-b30e919c41bc86b2.jpg" alt="" title="" width="200" border="0"/>
</a>
</div>
</div>
</div>
HIGH SCHOOL WRESTLING NOTEBOOK: A surge at Delaware Valley, team rankings shakeup and more.
</pre>
What I have so far is /<.*>\s/i
, but I need the opposite of that. Can someone help me?
Do not use regex to parse HTML, use PHP Domdocument instead.
It is not recommended to use regex to parse HTML, but since it's a simple task (and probably meant to learn regex):
You have this: /<.*>\s/i
1- The i
modifier does nothing here, since you are not using any character that could be case sensitive in the regex expression. i.e: /apple/i
makes sense cause you want to find Apple
. /\w+/i
does nothing since \w
includes both lowercase and uppercase characters.
2- If you are parsing HTML it's better to don't assume or use any \s
unless you are inside of a tag.
3- If you want to capture a part of the regex into a variable you have to use (
and )
. i.e: /(\w+) Apple/
parsing Red Apple
would give you Red
in $1
or in the array of matches of the preg_match()
function.
Now how would I do this:
First of all, I would remove any or
from the input string. Regex works much better with only 1 line of text. You can do this with a
str_replace()
If you want to get anything that is not inside <>
:
/>(.*?)</
If you want to get the text inside of a certain tag, for example <div>this one</div>
:
/<div>(.*?)<\/div>/
The ?
character makes the .*
match to be non-greedy, so It will get the least number of characters that match the pattern.
Hope it helped.