This appears to be strange behavior, or perhaps I don't understand regular expressions so well...
I'm using this to find all the xref and trailer objects in a PDF file:
preg_match_all('@(
xref?
)|(\strailer\s)@',$pdfcontent,$matches,PREG_OFFSET_CAPTURE);
print_r gives me this:
Array
(
[0] => Array
(
[0] => Array
(
[0] =>
xref
[1] => 13235519
)
[1] => Array
(
[0] =>
trailer
[1] => 13299371
)
)
[1] => Array
(
[0] => Array
(
[0] =>
xref
[1] => 13235519
)
[1] => Array
(
[0] =>
[1] => -1
)
)
[2] => Array
(
[0] =>
[1] => Array
(
[0] =>
trailer
[1] => 13299371
)
)
)
Why is there a position of -1 for xref?
It seems this is the normal behaviour, mostly undocumented though. The -1
offset is also used for absent matches.
To answer your title, the -1
offset is returned alternatively, not in addition. You have an alternative (a)|(b)
match group in your pattern. So it can very well return offsets and matches for the xref
, but a non-match for the trailer
.
This is not mentioned explicitely in the PHP manual page. But PCRE documents it cursorily with:
[...] When this happens, both values in the offset pairs corre- sponding to unused subpatterns are set to -1.
You can reproduce it with a simpler example:
preg_match_all('/(a)|(b)|(c)/', "abc", $m, PREG_OFFSET_CAPTURE)
and print_r($m);
[Have a look]. The behaviour is a bit confusing. It seems the -1
is used as offset for the early non-matches. But subsequent failed matches are just absent in the result array. This example gives [0,-1,-1]
and [undef,1,-1]
and [undef,undef,2]
for example. I would conclude it's some hazy behaviour in the PHP wrapper.
It seems to me you have 2 xref
without a trailer
in between. Something like:
xref
shgfjqhfkj
xref
shgfjqhfkj
trailer
And the matching groups are wrong.
I'd change the regex with:
'@(
xref?
|\strailer\s)@'