preg_match_all特殊条件下的子域

Update:

I'm using this regex:

/[^a-z^-^0-9^@^%^\/^:^\.^-]((?<!w\.)(?!w+\.)([0-9a-z][0-9a-z\-]*\.){2,}[a-z]+)(["|\s|<]|$)/i

The regex have a little problem when the string is only domains and nothing else, listed under each other in new lines

For example:

$string = 'sub84.example4.com
sub-example.example84.net
sub-84example.example-h1.org
www-example4124.domain.com
sub.example-www.com';

All domains should be matched, but he current regex is only match sub-example.example84.net and www-example4124.domain.com

I also looking to add some more conditions:

1) letters of domain must be small and can't be capital (the current one doesn't care about that) EX: Sub.example.Com is not ok.

2) No =SPACE (SPACE (" :SPACE before the domain and NO SPACE) or SPACE= ") SPACE: after it.

EX:

$string = '

text = sub1.example.com text
text ( sub2.example.com text
text sub3.example.com = text
text sub4.example.com ) text
text ("sub5.example.com text
text sub6.example.com") text
text : sub7.example.com text
text sub8.example.com : text

';

None of them is ok

3) Exclude .info .biz .tv tlds

Thank you.

I don't quite understand the purpose of doing this or the incredible number of conditions, but you might try this (rather really long one...):

(?<!http://)(?<!https://)(?<!www\.)(?![^\s<>]*[:@/])(?<![(=:] )(?<!\(")(?:(?<=[\s">])|^)(?:[a-z0-9-]+\.){2}(?!info|biz|tv)[a-z]+(?=[\s"<]|$)(?! [=):])(?!"\))

regex101 demo


Breakdown:

(?<!http://)(?<!https://)(?<!www\.)        # Prevent http:// https:// and www.
(?![^\s<>]*[:@/])                          # Prevent : @ /
(?<![(=:] )(?<!\(")                        # Prevent '( ' '= ' etc
(?:(?<=[\s">])|^)                          # Ensure there's ' ', '>', '"' or beginning of line
(?:[a-z0-9-]+\.){2}(?!info|biz|tv)[a-z]+   # Main match, alphanumerics and -, 2 parts + tlds (excluding info, biz, tv
(?=[\s"<]|$)                               # Ensure there's ' ', '<', '"' or end of line
(?! [=):])(?!"\))                          # Prevent ' )' ' =' etc

Here's the one

/[^a-z^-^0-9^@^\/^:^\.^-]((?<!w\.)(?!w+\.)([0-9a-z][0-9a-z\-]*\.)+[a-z]+)(["\s<]|$)/ig

Look-aheads and look-behinds are expensive as regexes go. You can do it without either with this monstrosity:

(?:^[^0-9a-zA-Z]??|[^=(:][^0-9a-zA-Z\-]|=[^ 0-9a-zA-Z]|\([^ "0-9a-zA-Z]|:[^ 0-9a-zA-Z]|[^0-9a-zA-Z]-)((?:[0-9a-z][0-9a-z\-]*\.){2,}(?:[ac-hj-su-z][a-z]*|b[a-z]?|bi[a-y]|b[a-z]{3,}|i[a-z]{0,2}|inf[a-np-z]|i[a-z]{4,}|t|t[a-uw-z]|t[a-z]{2,}))(?:[^a-zA-Z "]| [^=:)]|"[^)]|[ "]?$)

This will enforce lowercase sub-domains per 1, properly remove all the pre and post-conditions you list in 2, and disallow .biz, .info, and .tv per 3.

But you'll need to use it with the m modifier so that ^ and $ can match per line not just the beginning and end of the input. So you'd need to do something like:

preg_match_all('/(?:^[^0-9a-zA-Z]??|[^=(:][^0-9a-zA-Z\-]|=[^ 0-9a-zA-Z]|\([^ "0-9a-zA-Z]|:[^ 0-9a-zA-Z]|[^0-9a-zA-Z]-)((?:[0-9a-z][0-9a-z\-]*\.){2,}(?:[ac-hj-su-z][a-z]*|b[a-z]?|bi[a-y]|b[a-z]{3,}|i[a-z]{0,2}|inf[a-np-z]|i[a-z]{4,}|t|t[a-uw-z]|t[a-z]{2,}))(?:[^a-zA-Z "]| [^=:)]|"[^)]|[ "]?$)/m', $string, $foo);

Let me quickly explain the sections of the regex so this hopefully makes sense:

The first section is the alternative to a look behind at what was before the sub-domain it prevents the pre-conditions you listed in 2: (?:^[^0-9a-zA-Z]??|[^=(:][^0-9a-zA-Z\-]|=[^ 0-9a-zA-Z]|\([^ "0-9a-zA-Z]|:[^ 0-9a-zA-Z]|[^0-9a-zA-Z]-) it is a non-capturing group which will match 1 of 6 options:

  1. The start of a line and possibly 1 non-alpha-numeric character
  2. A character other than '=', '(', or ':' followed by a non-alpha-numeric non-'-' character
  3. An '=' followed by a non-space and non-alpha-numeric character
  4. A '(' followed by a character other than a '"', space, or alpha-numeric
  5. A ':' followed by a non-space and non-alpha-numeric character
  6. A non-alpha-numeric character followed by a '-' character

The second section which is inside the capture is your original sub-domain minus suffix match: (?:[0-9a-z][0-9a-z\-]*\.){2,}

The third section which is also inside the capture, eliminates sub-domains with the "biz", "info", or "tv" sufixes: (?:[ac-hj-su-z][a-z]*|b[a-z]?|bi[a-y]|b[a-z]{3,}|i[a-z]{0,2}|inf[a-np-z]|i[a-z]{4,}|t|t[a-uw-z]|t[a-z]{2,}) it is a non-capturing group with 10 options:

  1. A character other than 'b', 'i', or 't' followed by any number of lowercase characters
  2. A 'b' and some other lowercase character
  3. "bi" followed by a non-'z' character
  4. A 'b' followed by 3 or more lowercase characters
  5. An 'i' followed by 2 or fewer lowercase characters
  6. "inf" followed by a non-'o' character
  7. An 'i' followed by 4 or more lowercase characters
  8. A 't' character
  9. A 't' followed by a non-'v' character
  10. A 't' followed by 2 or more lowercase characters

The final section prevents the post-conditions you listed in 2: (?:[^a-zA-Z "]| [^=:)]|"[^)]|[ "]?$) it is a non-capturing group with 4 options:

  1. A character that is not a space, '"', lowercase, or uppercase character
  2. A space followed by a character other than '=', ':', or ')'
  3. A '"' followed by a non-')' character
  4. Non-space non-'"' followed by the end of line