在PHP中查找最相似的字符串?

I have an array of 17,000 strings. Many of the strings have similar matches, for example:

User Report XYZ123
Bob Smith
User Report YEI723
User Report
User Report
Number of Hits 27
Frank's Weekly Transaction Report
Transaction Report 123

What is the best way to find the top "similar strings"? For instance, using the example above, I would want to see "User Report" and "Transaction Report" as two of the top "similar strings".

Without giving you all the source code to do this, you could go through the array and remove components you consider useless, like any letters with numbers, and so on.

Then you can use array_count_values() and sort that array to see the top ones involved.

If you are able to get all the strings as an array and loop them in a foreach() like this:

$string_array = array('string', 'string1', 'string2', 'does-not-match');
$needle = 'string';

$results = array();
foreach($string_array as $key => $val):
    if (fnmatch($needle, $val):
        $results[] = $val;
    endif;
endforeach;

in the end you should end having the entries that match $needle. As alternative to fnmatch() you could use preg_match() and as pattern /string/i

$string_array = array('string', 'string1', 'string2', 'does-not-match');
$needle = '/string/i';

$results = array();
foreach($string_array as $key => $val):
    if (!empty(preg_match($needle, $val)):
        $results[] = $val;
    endif;
endforeach;

Note there could be issues when using empty() and pass the result of preg_match().:

Prior to PHP 5.5, empty() only supports variables; anything else will result in a parse error. In other words, the following will not work: empty(trim($name)). Instead, use trim($name) == false.

No errors should be issued with PHP version 5.3.x < 5.4

You could compute the Levenstein distance for each string compared with others and then sort them by that value.

$strings = array('str1', 'str2', 'car', 'dog', 'apple', 'house', 'str3');
$len = count($strings);

$distances = array_fill(0, $len, 0);

for($i=0; $i<$len-1; ++$i)
    for($j=$i+1; $j<$len; ++$j)
    {
        $dist = levenshtein($strings[$i], $strings[$j]);
        $distances[$i] += $dist;
        $distances[$j] += $dist;
    }

// Here $distances indicates how of "similar" is each string
// The lower values are more "similar"

I guess you could do a foreach through each of the strings and eliminate the ones that you don't want for that particular search. Then go through the once you have left (possibly with another foreach) and keep shrinking the number of strings that you have an interest in down until there are just a few. Then sort those by something like alphabetical order.