I have this loop in PHP (PHP 5.6, I am looking into upgrading to 7.3), over about 31K records:
foreach ($records as $i => $record) {
$tuple = call_user_func($processor, $record);
$keys = array_flip(
array_filter(
array_keys($tuple),
function($key) {
return ('.' !== substr($key, 0, 1));
}
)
);
if (!empty($keys)) {
$tuple = array_intersect_key($tuple, $keys);
$list[] = $tuple;
}
}
Each instance of $record is about 300 bytes, each instance of $tuple is about 1 Kb.
So, I start the cycle with 175 M of memory allocated, and I'd expect to end up gobbling another 30K x 1K = 30M. But let's say ten times that, it would be around 300 M.
Instead, after the loop memory_get_peak_usage()
reports 670 M burned, and indeed if I keep memory_limit below 730 M, the PHP process implodes with a memory exhausted error.
As I see things:
foreach ($records as $i => $record) {
// No O(n) memory usage
$tuple = call_user_func($processor, $record);
// No O(n) memory usage
$keys = array_flip(
array_filter(
array_keys($tuple),
function($key) {
return ('.' !== substr($key, 0, 1));
}
)
);
if (!empty($keys)) {
// No O(n) memory usage
$tuple = array_intersect_key($tuple, $keys);
// HERE I have O(n) memory usage
$list[] = $tuple;
}
}
If I comment out the $list[]
line, the memory consumption goes down to those 175 M. I have confirmed that the serialize()
representation of each tuple is 2.5 Kb in size. So even if PHP held those values in inefficient human-readable format, that would account for about 75 megabytes, not 500.
I thought that maybe there was some "slack" in PHP's allocation of objects. So I modified these lines:
$tuple = array_intersect_key($tuple, $keys);
$list[] = $tuple;
To:
$tuple = array_intersect_key($tuple, $keys);
$list[] = unserialize(serialize($tuple));
reasoning that unserialize()
would create a "fresh" PHP dictionary object (for that is what $tuple
is). Yes, I'm applying the "format and reinstall" voodoo approach to PHP objects.
Indeed, the memory consumption decreased from 688M to 511M, while the returned data remained the same (I dump them to a JSON file and run md5sum
on the result).
Unexpectedly, with the two extra calls, the script also becomes about 2-5% faster.
This tells me that there must be a whole lot of memory management going on behind the curtains that I'm not privy to. Also, I must have just scratched the surface (511 M is a long way off the 175 + 75 = 250 M which would still more than the bare necessities, but I'd settle for). There might be - there is likely to be - yet more memory and speed wasted somewhere in there.
Interesting side note: unsurprisingly, $keys = unserialize(serialize($keys));
, while not affecting memory usage, measurably increases the speed of that array_intersect_keys
- I'd dare say another 2% (about 3 seconds on an average command-line response time of 150 seconds).
Is there some way of, I don't know, further improving memory efficiency? Some esoteric PHP.INI setting I could try?
Perhaps most important, is this a known bug (I found nothing on the PHP changelogs), and maybe fixed in 7.3 (or irrelevant due to memory strategy changes) so it's not worth pursuing?
$tuple
as an intermediate value rather than directly storing into $list[]
does not change measurably either speed or memory usedThe record is generated by a processor function, it goes like this:
(As you may have surmised, this is a mapping for a LEFT JOIN).
One example of the full $tuple returned by the processor is:
array(56) {
["recid"]=>
int(2019022020175919710)
["Evt_IdeHea"]=>
int(2019022020175919710)
["Evt_IdeVnd"]=>
string(13) "REDACTED....."
["Evt_IdePdc"]=>
string(11) "REDACTED..."
["Evt_IdeRcp"]=>
string(28) "REDACTED...................."
["Evt_IdeOut"]=>
string(40) "REDACTED................................"
["Evt_Dat"]=>
string(10) "20/02/2019"
["Evt_IdeAge"]=>
string(11) "REDACTED..."
["Evt_Des"]=>
string(20) "Some text string at most 60 char long"
["Evt_Not"]=>
string(39) "Some text string at most 512 char long"
["Evt_Con"]=>
string(0) ""
["Evt_IdeMrc"]=>
int(99999)
["Evt_Lck"]=>
int(0)
["Evt_Qua"]=>
int(0)
["Evt_VarUte"]=>
string(11) "REDACTED"
["Evt_VarTot"]=>
int(1)
["Evt_VarIpa"]=>
string(12) "127.0.0.1"
["Evt_VarDat"]=>
string(16) "20/02/2019 20:17"
["Evt_SttRcd"]=>
int(0)
["Evt_CreUte"]=>
string(11) "REDACTED"
["Evt_CreIpa"]=>
string(12) "192.168.999.42"
["Evt_CreDat"]=>
string(16) "20/02/2019 20:17"
["Out_IdeRcp"]=>
string(28) "REDACTED...................."
["Out_IdePdc"]=>
string(11) "REDACTED..."
["Out_IdeHea"]=>
string(40) "REDACTED................................"
string(40) "REDACTED................................"
["Out_Des"]=>
string(20) "REDACTED (MAX 50)..."
["Out_Att"]=>
string(8) "REDACTED"
["Out_Reg"]=>
string(8) "PIEDMONT"
["Out_Pro"]=>
string(2) "CN"
["Out_Not"]=>
NULL
["Out_Lng"]=>
string(9) "8.0000000"
["Out_Lat"]=>
string(10) "44.7000000"
["Out_Ind"]=>
string(13) "STREETADDRESS"
["Out_Cit"]=>
string(4) "Alba"
["Out_Cap"]=>
string(5) "12051"
["Out_VarUte"]=>
string(3) "Sys"
["Out_VarTot"]=>
int(942)
["Out_VarIpa"]=>
string(7) "0.0.0.0"
["Out_VarDat"]=>
string(16) "20/02/2019 23:44"
["Out_SttRcd"]=>
int(0)
["Out_CreUte"]=>
string(3) "Sys"
["Out_CreIpa"]=>
string(7) "0.0.0.0"
["Out_CreDat"]=>
string(16) "26/07/2016 23:44"
["Vnd_IdeHea"]=>
string(13) "REDACTED....."
["Vnd_Lin"]=>
string(2) "L2"
["Vnd_Des"]=>
string(18) "REDACTED.........."
["Vnd_VarUte"]=>
string(6) "SYSTEM"
["Vnd_VarTot"]=>
int(1)
["Vnd_VarIpa"]=>
string(7) "0.0.0.0"
["Vnd_VarDat"]=>
string(16) "04/06/2016 10:16"
["Vnd_SttRcd"]=>
int(0)
["Vnd_CreUte"]=>
string(6) "SYSTEM"
["Vnd_CreIpa"]=>
string(7) "0.0.0.0"
["Vnd_CreDat"]=>
string(16) "16/03/2016 14:21"
["Age_Des"]=>
string(14) "REDACTED......"
["Usr_Des"]=>
string(17) "REDACTED........."
}
So, I start with around 31K records with only Evt_ keys, and end up with the same number of records with all the keys above. The serialized version of the record ranges from 1819 to 4120 bytes, with an average length of around 2350, so the sizes in the record above are pretty typical.