I have a string:
$day = "11.08.2012 PROC BRE-AMS 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) Y AMS-AMS 13:15-19:15"
And I have a regular expression:
$data = preg_split("/(?=[A-Z]{1,4}[\s]+[A-Z]{3}[\-][A-Z]{3}[\s]+)/", $day);
The expected $data
-Array should be:
array
0 => string '11.08.2012 ' (length=11)
1 => string 'PROC 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) ' (length=22)
2 => string 'Y AMS-AMS 13:15-19:15' (length=21)
But my result is:
0 => string '11.08.2012 ' (length=11)
1 => string 'P' (length=1)
2 => string 'R' (length=1)
3 => string 'O' (length=1)
4 => string 'C BRE-AMS 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) ' (length=59)
5 => string 'Y AMS-AMS 13:15-19:15' (length=21)
I cannot retrace what´s happening here. Could someone pleaqse explain?
In short, the problem is that (?=...) subexpression in your pattern match a position. I understand that was exactly your intention; the problem is, the next match is started not when the pattern specified in (?=) ends its match - but at the position matched by the lookahead + 1 symbol.
Let's check this process in details. First time the split is attempted, it walks the string until it got to the position marked by asterisk:
11.08.2012 *PROC BRE-AMS 08:00-12:00
... where it can match the pattern given. For the next attempt, the starting position 'bumps along' one symbol, so now we're here:
11.08.2012 P*ROC BRE-AMS 08:00-12:00
... and voila, we again can match this pattern, because of that {1,4}
quantifier! That's how you got these 'irregular' P
, R
and O
symbols.
That's for explanation, now for the "how to fix" part. The easiest way out of this, I suppose, is adding this little twist in your split pattern:
$data = preg_split('/\b(?=[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/', $day);
We still match for position - but now this position should be the one that separates a 'word' symbol from a non-word one. The same idea can be expressed with negative lookbehind pattern:
$data = preg_split('/(?<![A-Z])(?=[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/', $day);
... which is actually more precise, but less elegant, I suppose. )
Two sidenotes here: 1) don't use character class syntax when you need to specify a single symbol (simple - -
- or 'shortcut' one, like \s
); 2) use single quotation marks to delimit your pattern unless you want to interpolate some variables in it.
A hyphen is a metacharacter in a character class. If you want to include a hyphen in a character class you have to backslash escape it (although in this specific case it works since your character class has nothing but a hyphen).
If you need to include the split string, anchor the start of the lookahead to a word boundary, so that only the first letter of the first 1-4 character sequence is tested:
/(?=\b[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/'