使用正则表达式在键值中拆分字符串

I'm having some trouble parsing plain text output from samtools stats.

Example output:

45205768 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5203838 + 0 duplicates
44647359 + 0 mapped (98.76% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I'd like to parse the file line-by-line and get the following output in a PHP array like this:

Array(
 "in total" => [45205768,0],
 ...
)

So, long story short, I'd like to get the numerical values from the front of the line as an array of integers and the following string (without the brackets) as key.

^(\d+)\s\+\s(\d+)\s([a-zA-Z0-9 ]+).*$

This regex will put first value, second value and the following string without the brackets in the match groups 1, 2 and 3 respectively.

Regex101 demo

I think this is what your after:

^(\d+)(\s\+\s)(\d+)(.+)

See it work here on Regex101 Pick up the first and third groups

This can be solved with just two capture groups and the fullstring match.

My pattern accurately extracts the desired substrings and trims the trailing spaces from the to-be-declared "keys": Pattern Demo

^(\d+) \+ (\d+) \K[a-z\d ]+(?=\s)  #244steps

PHP Code: (Demo)

$txt='45205768 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5203838 + 0 duplicates
44647359 + 0 mapped (98.76% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)';

preg_match_all('/^(\d+) \+ (\d+) \K[a-z\d ]+(?=\s)/m',$txt,$out);
foreach($out[0] as $k=>$v){
    $result[$v]=[(int)$out[1][$k],(int)$out[2][$k]];  // re-casting strings as integers
}
var_export($result);

Output:

array (
  'in total' => array (0 => 45205768, 1 => 0),
  'secondary' => array (0 => 0, 1 => 0),
  'supplementary' => array (0 => 0, 1 => 0),
  'duplicates' => array (0 => 5203838, 1 => 0),
  'mapped' => array (0 => 44647359, 1 => 0),
  'paired in sequencing' => array (0 => 0, 1 => 0),
  'read1' => array (0 => 0, 1 => 0),
  'read2' => array (0 => 0, 1 => 0),
  'properly paired' => array (0 => 0, 1 => 0),
  'with itself and mate mapped' => array (0 => 0, 1 => 0),
  'singletons' => array (0 => 0, 1 => 0),
  'with mate mapped to a different chr' => array ( 0 => 0, 1 => 0)
)

Note that the last two lines of the input text generate a duplicate key in the $result array, meaning the earlier line's data is overwritten by the later line's data. If this is a concern, you might restructure your input data or just keep the parenthetical portion as part of the key for unique-ness.