PHP循环中大量内存使用:发现了一个调整,但寻找更多

The setup (please bear with me)

I have this loop in PHP (PHP 5.6, I am looking into upgrading to 7.3), over about 31K records:

    foreach ($records as $i => $record) {
        $tuple  = call_user_func($processor, $record);

        $keys   = array_flip(
             array_filter(
                 array_keys($tuple),
                 function($key) {
                     return ('.' !== substr($key, 0, 1));
                 }
             )
        );
        if (!empty($keys)) {
            $tuple  = array_intersect_key($tuple, $keys);
            $list[] = $tuple;
        }
    }

Each instance of $record is about 300 bytes, each instance of $tuple is about 1 Kb.

So, I start the cycle with 175 M of memory allocated, and I'd expect to end up gobbling another 30K x 1K = 30M. But let's say ten times that, it would be around 300 M.

Instead, after the loop memory_get_peak_usage() reports 670 M burned, and indeed if I keep memory_limit below 730 M, the PHP process implodes with a memory exhausted error.

As I see things:

    foreach ($records as $i => $record) {
        // No O(n) memory usage
        $tuple  = call_user_func($processor, $record);

        // No O(n) memory usage
        $keys   = array_flip(
             array_filter(
                 array_keys($tuple),
                 function($key) {
                     return ('.' !== substr($key, 0, 1));
                 }
             )
        );
        if (!empty($keys)) {
            // No O(n) memory usage
            $tuple  = array_intersect_key($tuple, $keys);
            // HERE I have O(n) memory usage
            $list[] = $tuple;
        }
    }

If I comment out the $list[] line, the memory consumption goes down to those 175 M. I have confirmed that the serialize() representation of each tuple is 2.5 Kb in size. So even if PHP held those values in inefficient human-readable format, that would account for about 75 megabytes, not 500.

The one discovery

I thought that maybe there was some "slack" in PHP's allocation of objects. So I modified these lines:

            $tuple  = array_intersect_key($tuple, $keys);
            $list[] = $tuple;

To:

            $tuple  = array_intersect_key($tuple, $keys);
            $list[] = unserialize(serialize($tuple));

reasoning that unserialize() would create a "fresh" PHP dictionary object (for that is what $tuple is). Yes, I'm applying the "format and reinstall" voodoo approach to PHP objects.

Indeed, the memory consumption decreased from 688M to 511M, while the returned data remained the same (I dump them to a JSON file and run md5sum on the result).

Unexpectedly, with the two extra calls, the script also becomes about 2-5% faster.

This tells me that there must be a whole lot of memory management going on behind the curtains that I'm not privy to. Also, I must have just scratched the surface (511 M is a long way off the 175 + 75 = 250 M which would still more than the bare necessities, but I'd settle for). There might be - there is likely to be - yet more memory and speed wasted somewhere in there.

Interesting side note: unsurprisingly, $keys = unserialize(serialize($keys));, while not affecting memory usage, measurably increases the speed of that array_intersect_keys - I'd dare say another 2% (about 3 seconds on an average command-line response time of 150 seconds).

The question(s)

Is there some way of, I don't know, further improving memory efficiency? Some esoteric PHP.INI setting I could try?

Perhaps most important, is this a known bug (I found nothing on the PHP changelogs), and maybe fixed in 7.3 (or irrelevant due to memory strategy changes) so it's not worth pursuing?

  • I cannot use an iterator instead of the loop over an array
  • I'm already uncomfortable with the current memory_limit (and the number of records is due to increase. I'd prefer not to keep hogging memory to pamper a design flaw).
  • the use of $tuple as an intermediate value rather than directly storing into $list[] does not change measurably either speed or memory used

Sample tuple record

The record is generated by a processor function, it goes like this:

  1. I have a dictionary in each $record, which may or may not have all the keys starting with "Evt_" ("Event Data"). The processor ensures that all the keys are present appending NULL-valued keys after the dictionary as needed.
  2. The processor also adds several more keys from other dictionaries, determined based on Evt keys. These dictionaries are complete, possibly all of their values have default values. Specifically: if Evt_IdeOut has NULL value, then all Out_* values will come from $Out_Default, and so on. Out_* data refer to geographical location, Vnd_* are attributes of that, Age and Usr refer to field agent and reporting user. The _Ipa fields are IPAddresses, _Dat are dates in Italian format (d/m/Y H:i).

(As you may have surmised, this is a mapping for a LEFT JOIN).

One example of the full $tuple returned by the processor is:

array(56) {
  ["recid"]=>
  int(2019022020175919710)
  ["Evt_IdeHea"]=>
  int(2019022020175919710)
  ["Evt_IdeVnd"]=>
  string(13) "REDACTED....."
  ["Evt_IdePdc"]=>
  string(11) "REDACTED..."
  ["Evt_IdeRcp"]=>
  string(28) "REDACTED...................."
  ["Evt_IdeOut"]=>
  string(40) "REDACTED................................"
  ["Evt_Dat"]=>
  string(10) "20/02/2019"
  ["Evt_IdeAge"]=>
  string(11) "REDACTED..."
  ["Evt_Des"]=>
  string(20) "Some text string at most 60 char long"
  ["Evt_Not"]=>
  string(39) "Some text string at most 512 char long"
  ["Evt_Con"]=>
  string(0) ""
  ["Evt_IdeMrc"]=>
  int(99999)
  ["Evt_Lck"]=>
  int(0)
  ["Evt_Qua"]=>
  int(0)
  ["Evt_VarUte"]=>
  string(11) "REDACTED"
  ["Evt_VarTot"]=>
  int(1)
  ["Evt_VarIpa"]=>
  string(12) "127.0.0.1"
  ["Evt_VarDat"]=>
  string(16) "20/02/2019 20:17"
  ["Evt_SttRcd"]=>
  int(0)
  ["Evt_CreUte"]=>
  string(11) "REDACTED"
  ["Evt_CreIpa"]=>
  string(12) "192.168.999.42"
  ["Evt_CreDat"]=>
  string(16) "20/02/2019 20:17"
  ["Out_IdeRcp"]=>
  string(28) "REDACTED...................."
  ["Out_IdePdc"]=>
  string(11) "REDACTED..."
  ["Out_IdeHea"]=>
  string(40) "REDACTED................................"
  string(40) "REDACTED................................"
  ["Out_Des"]=>
  string(20) "REDACTED (MAX 50)..."
  ["Out_Att"]=>
  string(8) "REDACTED"
  ["Out_Reg"]=>
  string(8) "PIEDMONT"
  ["Out_Pro"]=>
  string(2) "CN"
  ["Out_Not"]=>
  NULL
  ["Out_Lng"]=>
  string(9) "8.0000000"
  ["Out_Lat"]=>
  string(10) "44.7000000"
  ["Out_Ind"]=>
  string(13) "STREETADDRESS"
  ["Out_Cit"]=>
  string(4) "Alba"
  ["Out_Cap"]=>
  string(5) "12051"
  ["Out_VarUte"]=>
  string(3) "Sys"
  ["Out_VarTot"]=>
  int(942)
  ["Out_VarIpa"]=>
  string(7) "0.0.0.0"
  ["Out_VarDat"]=>
  string(16) "20/02/2019 23:44"
  ["Out_SttRcd"]=>
  int(0)
  ["Out_CreUte"]=>
  string(3) "Sys"
  ["Out_CreIpa"]=>
  string(7) "0.0.0.0"
  ["Out_CreDat"]=>
  string(16) "26/07/2016 23:44"
  ["Vnd_IdeHea"]=>
  string(13) "REDACTED....."
  ["Vnd_Lin"]=>
  string(2) "L2"
  ["Vnd_Des"]=>
  string(18) "REDACTED.........."
  ["Vnd_VarUte"]=>
  string(6) "SYSTEM"
  ["Vnd_VarTot"]=>
  int(1)
  ["Vnd_VarIpa"]=>
  string(7) "0.0.0.0"
  ["Vnd_VarDat"]=>
  string(16) "04/06/2016 10:16"
  ["Vnd_SttRcd"]=>
  int(0)
  ["Vnd_CreUte"]=>
  string(6) "SYSTEM"
  ["Vnd_CreIpa"]=>
  string(7) "0.0.0.0"
  ["Vnd_CreDat"]=>
  string(16) "16/03/2016 14:21"
  ["Age_Des"]=>
  string(14) "REDACTED......"
  ["Usr_Des"]=>
  string(17) "REDACTED........."
}

So, I start with around 31K records with only Evt_ keys, and end up with the same number of records with all the keys above. The serialized version of the record ranges from 1819 to 4120 bytes, with an average length of around 2350, so the sizes in the record above are pretty typical.