未找到标签之间的正则表达式文本[重复]

I am trying to match the word contact within content/text of html tags. I can get all text between tags:

http://rubular.com/r/IkhG2nhmnS

with:

(?<=\"\>)(.*?)(?=\<\/)

But I want to search for only the word contact, it doesn't work:

http://rubular.com/r/We44nHisLf

with:

(?<=\"\>)(contact*?)(?=\<\/)

Can anyone guide how do I match the word I want within the text/content of html tags. In above case I want to find/match the word contact

Thanks for your help

</div>

You probably want something like this:

(?<=\"\>).*(contact)?(?=\<\/)

Your current regex:

(?<=\"\>)(contact*?)(?=\<\/)

Will only match:

<a href="contact">contact</a>

But also...

<a href="contact">contactttt</a>

Or even...

<a href="contact">contac</a>

Since the * is applying only to the t preceding it.

The .* in my regex makes the allowance for any characters before contact.

If you really must use regexes for parsing HTML tags, then

(?<=>)[^<]*(contact)[^<]*(?=<\/)

Here is a test. Your match is in group 1.

But take a look at DOM functions instead, for proper parsing of structured documents.

Description

This regex will pull all text inside the href in the anchor tag.

<a\b[^>]*?\bhref=(['"])([^'"]*)\1[^>]*?>

enter image description here

Groups

group 0 will have the entire matched string from <a to the >

  1. receives the open quote for the href section. This is used later in the regex as \1 to match the close quote
  2. receives the content of the href value

Disclaimer

using a regex is probably not a good idea for parsing HTML as there many edge cases which can trip up a regex.

PHP Code Example:

<?php
$sourcestring="your source string";
preg_match_all('/<a\b[^>]*?\bhref=([\'"])([^\'"]*)\1[^>]*?>/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

$matches Array:
(
    [0] => Array
        (
            [0] => <a href="contact">
        )

    [1] => Array
        (
            [0] => "
        )

    [2] => Array
        (
            [0] => contact
        )

)

Summary

  • <a match <a
  • \b the boundary between a word char (\w) and something that is not a word char
  • [^>]*? any character except: '>' (0 or more times (matching the least amount possible))
  • \b the boundary between a word char (\w) and something that is not a word char
  • href= match href=
  • ( group and capture to \1:
  • ['"] any character of: ''', '"'
  • ) end of \1
  • ( group and capture to \2:
  • [^'"]* any character except: ''', '"' (0 or more times (matching the most amount possible))
  • ) end of \2
  • \1 what was matched by capture \1
  • [^>]*? any character except: '>' (0 or more times (matching the least amount possible))
  • > match >
  • ) end of grouping

The safest way to make sure you don't run into another tag before matching the text is:

(?<=\"\>)[^<]*(contact)

where

[^<]* 

means: (a character that is not a <), as many times as possible