如何检测文件名中的常见字符串组

I am have been trying to figure out a way I can detect series of files. For instance:

If a given directory has the following files:

  • Birthday001.jpg
  • Birthday002.jpg
  • Birthday003.jpg
  • Picknic1.jpg
  • Picknic2.jpg
  • Afternoon.jpg.

I would like to get the condense the listing to something like

  • Birthday ( 3 pictures )
  • Picknic ( 2 pictures )
  • Afternoon ( 1 picture )

How should I go about detecting the groups?

Here's one way you can solve this, which is more efficient than a brute force method.

  • load all the names into an associative array with key equal to the name and value equal to the name but with digits stripped (preg_replace('/\d//g', $key)).

You will have something like $arr1 = [Birthday001 => Birthday, Birthday002 => Birthday ...]

  • now make another associative array with keys that are values from the first array and value which is a count. Increment the count when you've already seen the key.
  • in the end you will end up with a 2nd array that contains the names and counts, just like you wanted. Something like $arr2 = [Birthday => 2, ...]

Simply build a histogram whose keys are modified by a regex:

<?php

# input
$filenames = array("Birthday001.jpg", "Birthday002.jpg", "Birthday003.jpg", "Picknic1.jpg", "Picknic2.jpg", "Afternoon.jpg");

# create histogram
$histogram = array();
foreach ($filenames as $filename) {
    $name = preg_replace('/\d+\.[^.]*$/', '', $filename);
    if (isset($histogram[$name])) {
        $histogram[$name]++;
    } else {
        $histogram[$name] = 1;
    }
}

# output
foreach ($histogram as $name => $count) {
    if ($count == 1) {
        echo "$name ($count picture)
";
    } else {
        echo "$name ($count pictures)
";
    }
}

?>

Generate an array of words like "my" (developing this array will be very important, "my" is the only one in your example given) and strip these out of all the file names. Strip out all numbers and punctuation, also extensions should be long gone at this point. Once this is done, put all of the unique results into an array. You can then use this as a fairly reliable source of keywords to search for any stragglers that the other processing didn't catch.