无关的空追踪比赛

I'm trying to create a regular expression for splitting a string, but unfortunately my requirements are a little more complex than a simple split, so I can't use for example preg_split() in PHP.

So what I'm doing is matching my delimiters (or rather, part of them) in a sub-expression, and everything preceding it in another sub-expression, and also treating the end of string as a delimiter for this purpose. With this in mind I came up with the following:

([^?;]*)(?|\?([0-9]*)|(;)|$)

As you can hopefully see, the first sub-pattern looks for a chunk of text without any question marks of semi-colons. After this I have a sub-pattern matching any question mark with an optional number afterwards (which is stored), either that or a semi-colon (which is stored) or the end of string.

The problem is that I seem to be getting an extraneous, empty, match against the end of string case, like so:

$sql = 'CALL foo(?0, ?1, ?2, ?3)';
preg_match_all('/([^?;]*)(?|\?([0-9]*)|(;)|$)/', $sql, $matches);
print_r($matches);

Producing output that looks like:

Array
(
    [0] => Array
        (
            [0] => CALL insert_host(?0
            [1] => , ?1
            [2] => , ?2
            [3] => , ?3
            [4] => )
            [5] => 
        )

    [1] => Array
        (
            [0] => CALL insert_host(
            [1] => , 
            [2] => , 
            [3] => , 
            [4] => )
            [5] => 
        )

    [2] => Array
        (
            [0] => 0
            [1] => 1
            [2] => 2
            [3] => 3
            [4] => 
            [5] => 
        )

)

Note the empty match under $matches[0][5]; I would have expected to the end of string case to be satisfied after matching the bracket, resulting in no further matching, yet it's gone on to produce another match and I can't figure out why.

So my question is; why is an extra match being produced here, and how do I prevent it?

NOTE: I've already considered requiring that the end of string case have at least one character before it, but this is no good as I do in fact want an empty result if a wildcard is at the of the string, as I'm trying to emulate the behaviour of a split function. For example if the input was SELECT ?, I would expect to match SELECT ? plus an empty string. The idea here is that once I've handled any semi-colons matched then I can simply do implode('?', $matches[1]) to reproduce the statement with numeric wildcards.

I believe I may have figured out an alternative to my specific case that will solve the problem; what I've done is flipped the expression around such that a delimiter is matched first or, failing that, the start of string, like so:

(?|\?([0-9]*)|(;)|^)([^?;]*)

This produces the expected results in all cases:

preg_match_all('/(?|\?([0-9]*)|(;)|^)([^?;]*)/', 'CALL foo(?3, ?2, ?1, ?0)', $matches);
print_r($matches);

Produces:

Array
(
    [0] => Array
        (
            [0] => CALL foo(
            [1] => ?3, 
            [2] => ?2, 
            [3] => ?1, 
            [4] => ?0)
        )
    [1] => Array
        (
            [0] => 
            [1] => 3
            [2] => 2
            [3] => 1
            [4] => 0
        )

    [2] => Array
        (
            [0] => CALL foo(
            [1] => , 
            [2] => , 
            [3] => , 
            [4] => )
        )
)

While:

preg_match_all('/(?|\?([0-9]*)|(;)|^)([^?;]*)/', 'SELECT ?', $matches);
print_r($matches);

Produces:

Array
(
    [0] => Array
        (
            [0] => SELECT 
            [1] => ?
        )
    [1] => Array
        (
            [0] => 
            [1] => 
        )
    [2] => Array
        (
            [0] => SELECT 
            [1] => 
        )
)

However, this only works because I know that input will never include a delimiter as the first character; if I provide one it encounters much the same problem so I'm not sure whether to call it a true solution or not.

I'm also still interested to know why my original expression was getting an extra match, as I would have expected greedy matching to mean that it was impossible, as once the end of string is matched there should be nothing left to find.