I'm trying to create a regular expression for splitting a string, but unfortunately my requirements are a little more complex than a simple split, so I can't use for example preg_split()
in PHP.
So what I'm doing is matching my delimiters (or rather, part of them) in a sub-expression, and everything preceding it in another sub-expression, and also treating the end of string as a delimiter for this purpose. With this in mind I came up with the following:
([^?;]*)(?|\?([0-9]*)|(;)|$)
As you can hopefully see, the first sub-pattern looks for a chunk of text without any question marks of semi-colons. After this I have a sub-pattern matching any question mark with an optional number afterwards (which is stored), either that or a semi-colon (which is stored) or the end of string.
The problem is that I seem to be getting an extraneous, empty, match against the end of string case, like so:
$sql = 'CALL foo(?0, ?1, ?2, ?3)';
preg_match_all('/([^?;]*)(?|\?([0-9]*)|(;)|$)/', $sql, $matches);
print_r($matches);
Producing output that looks like:
Array
(
[0] => Array
(
[0] => CALL insert_host(?0
[1] => , ?1
[2] => , ?2
[3] => , ?3
[4] => )
[5] =>
)
[1] => Array
(
[0] => CALL insert_host(
[1] => ,
[2] => ,
[3] => ,
[4] => )
[5] =>
)
[2] => Array
(
[0] => 0
[1] => 1
[2] => 2
[3] => 3
[4] =>
[5] =>
)
)
Note the empty match under $matches[0][5]
; I would have expected to the end of string case to be satisfied after matching the bracket, resulting in no further matching, yet it's gone on to produce another match and I can't figure out why.
So my question is; why is an extra match being produced here, and how do I prevent it?
NOTE: I've already considered requiring that the end of string case have at least one character before it, but this is no good as I do in fact want an empty result if a wildcard is at the of the string, as I'm trying to emulate the behaviour of a split function. For example if the input was SELECT ?
, I would expect to match SELECT ?
plus an empty string. The idea here is that once I've handled any semi-colons matched then I can simply do implode('?', $matches[1])
to reproduce the statement with numeric wildcards.
I believe I may have figured out an alternative to my specific case that will solve the problem; what I've done is flipped the expression around such that a delimiter is matched first or, failing that, the start of string, like so:
(?|\?([0-9]*)|(;)|^)([^?;]*)
This produces the expected results in all cases:
preg_match_all('/(?|\?([0-9]*)|(;)|^)([^?;]*)/', 'CALL foo(?3, ?2, ?1, ?0)', $matches);
print_r($matches);
Produces:
Array
(
[0] => Array
(
[0] => CALL foo(
[1] => ?3,
[2] => ?2,
[3] => ?1,
[4] => ?0)
)
[1] => Array
(
[0] =>
[1] => 3
[2] => 2
[3] => 1
[4] => 0
)
[2] => Array
(
[0] => CALL foo(
[1] => ,
[2] => ,
[3] => ,
[4] => )
)
)
While:
preg_match_all('/(?|\?([0-9]*)|(;)|^)([^?;]*)/', 'SELECT ?', $matches);
print_r($matches);
Produces:
Array
(
[0] => Array
(
[0] => SELECT
[1] => ?
)
[1] => Array
(
[0] =>
[1] =>
)
[2] => Array
(
[0] => SELECT
[1] =>
)
)
However, this only works because I know that input will never include a delimiter as the first character; if I provide one it encounters much the same problem so I'm not sure whether to call it a true solution or not.
I'm also still interested to know why my original expression was getting an extra match, as I would have expected greedy matching to mean that it was impossible, as once the end of string is matched there should be nothing left to find.