This question already has an answer here:
I am trying to match the word contact
within content/text of html tags. I can get all text between tags:
http://rubular.com/r/IkhG2nhmnS
with:
(?<=\"\>)(.*?)(?=\<\/)
But I want to search for only the word contact
, it doesn't work:
http://rubular.com/r/We44nHisLf
with:
(?<=\"\>)(contact*?)(?=\<\/)
Can anyone guide how do I match the word I want within the text/content of html tags. In above case I want to find/match the word contact
Thanks for your help
</div>
You probably want something like this:
(?<=\"\>).*(contact)?(?=\<\/)
Your current regex:
(?<=\"\>)(contact*?)(?=\<\/)
Will only match:
<a href="contact">contact</a>
But also...
<a href="contact">contactttt</a>
Or even...
<a href="contact">contac</a>
Since the *
is applying only to the t
preceding it.
The .*
in my regex makes the allowance for any characters before contact
.
If you really must use regexes for parsing HTML tags, then
(?<=>)[^<]*(contact)[^<]*(?=<\/)
Here is a test. Your match is in group 1.
But take a look at DOM functions instead, for proper parsing of structured documents.
This regex will pull all text inside the href in the anchor tag.
<a\b[^>]*?\bhref=(['"])([^'"]*)\1[^>]*?>
group 0 will have the entire matched string from <a
to the >
\1
to match the close quoteusing a regex is probably not a good idea for parsing HTML as there many edge cases which can trip up a regex.
<?php
$sourcestring="your source string";
preg_match_all('/<a\b[^>]*?\bhref=([\'"])([^\'"]*)\1[^>]*?>/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => <a href="contact">
)
[1] => Array
(
[0] => "
)
[2] => Array
(
[0] => contact
)
)
<a
match <a
\b
the boundary between a word char (\w) and something that is not a word char[^>]*?
any character except: '>' (0 or more times (matching the least amount possible))\b
the boundary between a word char (\w) and something that is not a word charhref=
match href=
(
group and capture to \1:['"]
any character of: ''', '"')
end of \1(
group and capture to \2:[^'"]*
any character except: ''', '"' (0 or more times (matching the most amount possible)))
end of \2\1
what was matched by capture \1[^>]*?
any character except: '>' (0 or more times (matching the least amount possible))>
match >
)
end of groupingThe safest way to make sure you don't run into another tag before matching the text is:
(?<=\"\>)[^<]*(contact)
where
[^<]*
means: (a character that is not a <), as many times as possible