如何优化preg_match_all或其他替代方案?

I have this code:

function toDataUri( $html )
{
  # convert css URLs to data URIs
  $html = preg_replace_callback( "#(url\([\'\"]?)([^\"\'\)]+)([\"\']?\))#", 'create_data_uri', $html );
  return $html;
}

// callback function
private function create_data_uri( $matches )
{
  $filetype = explode( '.', $matches[ 2 ] );
  $filetype = trim(strtolower( $filetype[ count( $filetype ) - 1 ] ));

  // replace ?whatever=value from extensions
  $filetype = preg_replace('#\?.*#', '', $filetype);

  $datauri = $matches[ 2 ];
  $data =  get_file_contents( $datauri );

  if (! $data) return $matches[ 0 ];

  $data = base64_encode( $data );

  //compile and return a data: URI with the encoded image data
  return $matches[ 1 ] . "data:image/$filetype;base64,$data" . $matches[ 3 ];
}

It basically searches for URLs with format url(path) in HTML file and replaces them with base 64 Data URIS.

The problem is that if input html is few kilos such as 10kb, it takes ages to return the final response. Is there any optimization we can do in such case or any other solution you have that when given html, it searches for url(path) matches and converts them to data uris ?

The expression is cheap already — starts with a fixed string and doesn't need backtracking.

In PCRE there's S modifier that enables some regex optimisation, but it matters only for patterns without a fixed prefix.

It shouldn't be slow — 10KB isn't much for a simple regex like this. Perhaps the bottleneck is somewhere else?

  • If you have unclosed url( in the parsed file and no ) to the end of the file, then it'll scan a bit more. [^\"\'\)]{0,1000} would limit that. But it's a minor optimisation that only makes a difference when you have pathological syntax errors in the file.
  • You can remove () around whole expression. 0th match is always capturing entire string.