Update:
I'm using this regex:
/[^a-z^-^0-9^@^%^\/^:^\.^-]((?<!w\.)(?!w+\.)([0-9a-z][0-9a-z\-]*\.){2,}[a-z]+)(["|\s|<]|$)/i
The regex have a little problem when the string is only domains and nothing else, listed under each other in new lines
For example:
$string = 'sub84.example4.com
sub-example.example84.net
sub-84example.example-h1.org
www-example4124.domain.com
sub.example-www.com';
All domains should be matched, but he current regex is only match sub-example.example84.net
and www-example4124.domain.com
I also looking to add some more conditions:
1) letters of domain must be small and can't be capital (the current one doesn't care about that) EX: Sub.example.Com
is not ok.
2) No =SPACE
(SPACE
("
:SPACE
before the domain and NO SPACE)
or SPACE=
")
SPACE:
after it.
EX:
$string = '
text = sub1.example.com text
text ( sub2.example.com text
text sub3.example.com = text
text sub4.example.com ) text
text ("sub5.example.com text
text sub6.example.com") text
text : sub7.example.com text
text sub8.example.com : text
';
None of them is ok
3) Exclude .info
.biz
.tv
tlds
Thank you.
I don't quite understand the purpose of doing this or the incredible number of conditions, but you might try this (rather really long one...):
(?<!http://)(?<!https://)(?<!www\.)(?![^\s<>]*[:@/])(?<![(=:] )(?<!\(")(?:(?<=[\s">])|^)(?:[a-z0-9-]+\.){2}(?!info|biz|tv)[a-z]+(?=[\s"<]|$)(?! [=):])(?!"\))
Breakdown:
(?<!http://)(?<!https://)(?<!www\.) # Prevent http:// https:// and www.
(?![^\s<>]*[:@/]) # Prevent : @ /
(?<![(=:] )(?<!\(") # Prevent '( ' '= ' etc
(?:(?<=[\s">])|^) # Ensure there's ' ', '>', '"' or beginning of line
(?:[a-z0-9-]+\.){2}(?!info|biz|tv)[a-z]+ # Main match, alphanumerics and -, 2 parts + tlds (excluding info, biz, tv
(?=[\s"<]|$) # Ensure there's ' ', '<', '"' or end of line
(?! [=):])(?!"\)) # Prevent ' )' ' =' etc
Here's the one
/[^a-z^-^0-9^@^\/^:^\.^-]((?<!w\.)(?!w+\.)([0-9a-z][0-9a-z\-]*\.)+[a-z]+)(["\s<]|$)/ig
Look-aheads and look-behinds are expensive as regexes go. You can do it without either with this monstrosity:
(?:^[^0-9a-zA-Z]??|[^=(:][^0-9a-zA-Z\-]|=[^ 0-9a-zA-Z]|\([^ "0-9a-zA-Z]|:[^ 0-9a-zA-Z]|[^0-9a-zA-Z]-)((?:[0-9a-z][0-9a-z\-]*\.){2,}(?:[ac-hj-su-z][a-z]*|b[a-z]?|bi[a-y]|b[a-z]{3,}|i[a-z]{0,2}|inf[a-np-z]|i[a-z]{4,}|t|t[a-uw-z]|t[a-z]{2,}))(?:[^a-zA-Z "]| [^=:)]|"[^)]|[ "]?$)
This will enforce lowercase sub-domains per 1, properly remove all the pre and post-conditions you list in 2, and disallow .biz, .info, and .tv per 3.
But you'll need to use it with the m
modifier so that ^
and $
can match per line not just the beginning and end of the input. So you'd need to do something like:
preg_match_all('/(?:^[^0-9a-zA-Z]??|[^=(:][^0-9a-zA-Z\-]|=[^ 0-9a-zA-Z]|\([^ "0-9a-zA-Z]|:[^ 0-9a-zA-Z]|[^0-9a-zA-Z]-)((?:[0-9a-z][0-9a-z\-]*\.){2,}(?:[ac-hj-su-z][a-z]*|b[a-z]?|bi[a-y]|b[a-z]{3,}|i[a-z]{0,2}|inf[a-np-z]|i[a-z]{4,}|t|t[a-uw-z]|t[a-z]{2,}))(?:[^a-zA-Z "]| [^=:)]|"[^)]|[ "]?$)/m', $string, $foo);
Let me quickly explain the sections of the regex so this hopefully makes sense:
The first section is the alternative to a look behind at what was before the sub-domain it prevents the pre-conditions you listed in 2: (?:^[^0-9a-zA-Z]??|[^=(:][^0-9a-zA-Z\-]|=[^ 0-9a-zA-Z]|\([^ "0-9a-zA-Z]|:[^ 0-9a-zA-Z]|[^0-9a-zA-Z]-)
it is a non-capturing group which will match 1 of 6 options:
'='
, '('
, or ':'
followed by a non-alpha-numeric non-'-'
character'='
followed by a non-space and non-alpha-numeric character'('
followed by a character other than a '"'
, space, or alpha-numeric':'
followed by a non-space and non-alpha-numeric character'-'
characterThe second section which is inside the capture is your original sub-domain minus suffix match: (?:[0-9a-z][0-9a-z\-]*\.){2,}
The third section which is also inside the capture, eliminates sub-domains with the "biz"
, "info"
, or "tv"
sufixes: (?:[ac-hj-su-z][a-z]*|b[a-z]?|bi[a-y]|b[a-z]{3,}|i[a-z]{0,2}|inf[a-np-z]|i[a-z]{4,}|t|t[a-uw-z]|t[a-z]{2,})
it is a non-capturing group with 10 options:
'b'
, 'i'
, or 't'
followed by any number of lowercase characters'b'
and some other lowercase character"bi"
followed by a non-'z'
character'b'
followed by 3 or more lowercase characters'i'
followed by 2 or fewer lowercase characters"inf"
followed by a non-'o'
character'i'
followed by 4 or more lowercase characters't'
character't'
followed by a non-'v'
character't'
followed by 2 or more lowercase charactersThe final section prevents the post-conditions you listed in 2: (?:[^a-zA-Z "]| [^=:)]|"[^)]|[ "]?$)
it is a non-capturing group with 4 options:
'"'
, lowercase, or uppercase character'='
, ':'
, or ')'
'"'
followed by a non-')'
character'"'
followed by the end of line