This is my current regex (used in parsing an iCal file):
/(.*?)(?:;(?=(?:[^"]*"[^"]*")*[^"]*$))([\w\W]*)/
The current output using preg_match()
is this:
//Output 1 - `preg_match()`
Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London"
[1] => VALUE=DATE;RSVP=FALSE;LANGUAGE=en-gb
)
I would like to extend my regex to output this (i.e. find multiple matches):
//Output 2
Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London"
[1] => VALUE=DATE
[2] => RSVP=FALSE
[3] => LANGUAGE=en-gb
)
The regex should search for each semicolon not contained within a quoted substring and provide that as a match.
Cannot just swap to preg_match_all()
as gives this unwanted output
//Output 3 - `preg_match_all()`
Array
(
[0] => Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London";VALUE=DATE;RSVP=FALSE;LANGUAGE=en-gb
)
[1] => Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London"
)
[2] => Array
(
[0] => VALUE=DATE;RSVP=FALSE;LANGUAGE=en-gb
)
)
You need to use preg_match_all
to get all the match of the string.
The pattern you use isn't designed to get several results since [\w\W]*
matches everything until the end of the string.
But it's only one of your problems, a pattern designed like this need to check (for each colon) if the number of quotes is odd or even until the end of the file!: (?=(?:[^"]*"[^"]*")*[^"]*$)
. Imagine a minute how many times the whole string is parsed with this lookahead.
To avoid the problem, you can use a different approach that doesn't try to find colons, but that tries to describe everything that is not a colon: So you are looking for every parts of text that doesn't contains quotes or colon + quoted parts whatever the content.
You can use this kind of pattern:
$pattern = '~[^
";]+(?:"[^"\\\]*(?:\\\.[^"\\\]*)*"[^
";]*)*~';
if (preg_match_all($pattern, $str, $matches))
print_r($matches[0]);
pattern details:
~ # pattern delimiter
[^
";]+ #" # all that is not a newline, a double quote or a colon
(?: # non-capturing group: to include eventual quoted parts
" #"# a literal quote
[^"\\\]* #"# all that is not a quote or a backslash
(?:\\\.[^"\\\]*)* #"# optional group to deal with escaped characters
" #"#
[^
";]* #"#
)* # repeat zero or more times
~