不规则的RegEx行为

I have a string:

$day = "11.08.2012 PROC BRE-AMS 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) Y AMS-AMS 13:15-19:15"

And I have a regular expression:

$data = preg_split("/(?=[A-Z]{1,4}[\s]+[A-Z]{3}[\-][A-Z]{3}[\s]+)/", $day);

The expected $data-Array should be:

array
      0 => string '11.08.2012 ' (length=11)
      1 => string 'PROC 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) ' (length=22)
      2 => string 'Y AMS-AMS 13:15-19:15' (length=21)

But my result is:

0 => string '11.08.2012 ' (length=11)
      1 => string 'P' (length=1)
      2 => string 'R' (length=1)
      3 => string 'O' (length=1)
      4 => string 'C BRE-AMS 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) ' (length=59)
      5 => string 'Y AMS-AMS 13:15-19:15' (length=21)

I cannot retrace what´s happening here. Could someone pleaqse explain?

In short, the problem is that (?=...) subexpression in your pattern match a position. I understand that was exactly your intention; the problem is, the next match is started not when the pattern specified in (?=) ends its match - but at the position matched by the lookahead + 1 symbol.

Let's check this process in details. First time the split is attempted, it walks the string until it got to the position marked by asterisk:

11.08.2012 *PROC BRE-AMS 08:00-12:00

... where it can match the pattern given. For the next attempt, the starting position 'bumps along' one symbol, so now we're here:

11.08.2012 P*ROC BRE-AMS 08:00-12:00

... and voila, we again can match this pattern, because of that {1,4} quantifier! That's how you got these 'irregular' P, R and O symbols.


That's for explanation, now for the "how to fix" part. The easiest way out of this, I suppose, is adding this little twist in your split pattern:

$data = preg_split('/\b(?=[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/', $day);

We still match for position - but now this position should be the one that separates a 'word' symbol from a non-word one. The same idea can be expressed with negative lookbehind pattern:

$data = preg_split('/(?<![A-Z])(?=[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/', $day);

... which is actually more precise, but less elegant, I suppose. )

Two sidenotes here: 1) don't use character class syntax when you need to specify a single symbol (simple - - - or 'shortcut' one, like \s); 2) use single quotation marks to delimit your pattern unless you want to interpolate some variables in it.

A hyphen is a metacharacter in a character class. If you want to include a hyphen in a character class you have to backslash escape it (although in this specific case it works since your character class has nothing but a hyphen).

If you need to include the split string, anchor the start of the lookahead to a word boundary, so that only the first letter of the first 1-4 character sequence is tested:

/(?=\b[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/'