I have an array of 17,000 strings. Many of the strings have similar matches, for example:
User Report XYZ123 Bob Smith User Report YEI723 User Report User Report Number of Hits 27 Frank's Weekly Transaction Report Transaction Report 123
What is the best way to find the top "similar strings"? For instance, using the example above, I would want to see "User Report" and "Transaction Report" as two of the top "similar strings".
Without giving you all the source code to do this, you could go through the array and remove components you consider useless, like any letters with numbers, and so on.
Then you can use array_count_values()
and sort that array to see the top ones involved.
If you are able to get all the strings as an array and loop them in a foreach()
like this:
$string_array = array('string', 'string1', 'string2', 'does-not-match');
$needle = 'string';
$results = array();
foreach($string_array as $key => $val):
if (fnmatch($needle, $val):
$results[] = $val;
endif;
endforeach;
in the end you should end having the entries that match $needle
. As alternative to fnmatch()
you could use preg_match()
and as pattern /string/i
$string_array = array('string', 'string1', 'string2', 'does-not-match');
$needle = '/string/i';
$results = array();
foreach($string_array as $key => $val):
if (!empty(preg_match($needle, $val)):
$results[] = $val;
endif;
endforeach;
Note there could be issues when using empty() and pass the result of preg_match().:
Prior to PHP 5.5, empty() only supports variables; anything else will result in a parse error. In other words, the following will not work: empty(trim($name)). Instead, use trim($name) == false.
No errors should be issued with PHP version 5.3.x < 5.4
You could compute the Levenstein distance for each string compared with others and then sort them by that value.
$strings = array('str1', 'str2', 'car', 'dog', 'apple', 'house', 'str3');
$len = count($strings);
$distances = array_fill(0, $len, 0);
for($i=0; $i<$len-1; ++$i)
for($j=$i+1; $j<$len; ++$j)
{
$dist = levenshtein($strings[$i], $strings[$j]);
$distances[$i] += $dist;
$distances[$j] += $dist;
}
// Here $distances indicates how of "similar" is each string
// The lower values are more "similar"
I guess you could do a foreach through each of the strings and eliminate the ones that you don't want for that particular search. Then go through the once you have left (possibly with another foreach) and keep shrinking the number of strings that you have an interest in down until there are just a few. Then sort those by something like alphabetical order.