I am interested if it is known what algorithms PHP uses for explode/implode functions, and what are their time complexities?
Thank you in advance.
In string.c
you can see the algorithm. It starts at about 1021 line..
if (p2 == NULL) {
add_next_index_stringl(return_value, p1, Z_STRLEN_P(str), 1);
} else {
do {
add_next_index_stringl(return_value, p1, p2 - p1, 1);
p1 = p2 + Z_STRLEN_P(delim);
} while ((p2 = php_memnstr(p1, Z_STRVAL_P(delim), Z_STRLEN_P(delim), endp)) != NULL &&
--limit > 1);
if (p1 <= endp)
add_next_index_stringl(return_value, p1, endp-p1, 1);
}
Its just a single loop So I'd call it has O(N)
complexity. And check carefully the code. Its scanning the string and adding the result to return_value
. So yes. Its linear.
According to PHP sources on GitHub, it is linear. You can check explode()
here.
Short answer: For a single byte delimiter, explode
’s time complexeity is in Ο(N); but for multiple byte delimiters, its time complexity is Ο(N2).
implode
is clearly in Ο(N) as it simply glues the pieces together.
Extended answer: The basic algorithm of explode
is to search for occurrences of delimiter in string and copy the enclosed substrings into a new array.
To find the positions of delimiter in string, it uses the internal function zend_memnstr
(php_memnstr
is just an alias for zebd_memnstr
). For a single byte, it simply calls memchr
that does a linear search (thus in Ο(N)).
But for delimiter values longer than one byte, it calls memchr
to search for positions of the first byte of delimiter in string, tests if the last byte of delimiter is present at the expected position in string, and calls memcmp
to also check the bytes in between. So it basically checks whether delimiter is contained in string for any possible position. This already sounds suspiciously like Ο(N2).
Now let’s have a look at the worst case for this algorithm where both the first and the last byte of the pattern fit but the second-last doesn’t, e.g.:
string: aaaabaaaa
delimiter: aaaaaa
aaaabaaaa
aaaaXa (1+1+5)
aaaX?a (1+1+4)
aaX??a (1+1+3)
aX???a (1+1+2)
A X
represents a mismatch in memcmp
and ?
unknown bytes. The value in parentheses is the time complexity in uniform measure. This would sum up to
Σ (2+i) for i from M-floor(N/2) to ceil(N/2)
or
(N-M+1)·2 + Σ i - Σ j for i from 1 to ceil(N/2), j from 1 to M-floor(N/2)-1.
Since Σ i for i from 1 to N can be expressed by N·(N+1)/2 = (N2+N)/2, we can also write:
(N-M+1)·2 + (ceil(N/2)2+ceil(N/2))/2 - ((M-floor(N/2)-1)2+(M-floor(N/2)-1))/2
For simplicity, let’s assume both N and M are always even, so we can omit the ‘ceil’s and ‘floor’s:
(N-M+1)·2 + ((N/2+1)2+N/2+1)/2 - ((M-N/2-1)2+(M-N/2)-1)/2
= (N-M+1)·2 + N2/8+3·N/4+1 - ((M-N/2-1)2+(M-N/2)-1)/2
Furthermore, we can estimate the values up: N-M < N and M-N/2-1 < N. Thus we get:
N·2 + N2/8+3·N/4+1 - (N2+N)/2
< N·2 + N2+4·N - N2+N
This proofs that explode
with multiple byte delimiters is in Ο(N2).