PHP中的类似文本

I've PHP array something like this

$array = array("foo", "bar", "hallo", "world", "fooo", "bar1", "hall_o", "wor1ld", "foo", "bard", "hzallo", "w44orld");

I want to compare each element of an array with remaining elements.

Ex: I want to compre "foo" with "bar", "hallo", "world", "fooo", "bar1", "hall_o", "wor1ld", "foo", "bard", "hzallo" and "w44orld".

Then, I want to compre "bar" with "foo", "hallo", "world", "fooo", "bar1", "hall_o", "wor1ld", "foo", "bard", "hzallo", "w44orld" and so on till last element.

Let's consider element, which we are comparing as $var_1 and variable for remaining elements as $var_2; If similar_text($var_1, $var_2, $percent); returns $percent value > 90% then I want to print $var_1 and all corresponding similar text values of $var_2 for which matching percentage > 90

Currently I'm planning to use two loops to achieve this, external loop for $var_1 and internal loop for $var_2 . Each element of the array can have value upto 5000 characters and there can be 1000 elements in a array, so my current logic is very expensive.

Any direction to handle it in better way?

In order for the indexing to work, the array $arr must have unique values:

$arr = array("foo", "bar", "hallo", "world", "fooo", "bar1", "hall_o", "wor1ld", "bard", "hzallo", "w44orld");
$dexed = array();
foreach ($arr as $key => $value){
    $dexed[$key]['val'] = $value;
    $dexed[$key]['key'] = $key;
}
$out = array();//output
$rev = array();//reverse lookup array
$t = 80;//threshold value
$cnt = count($dexed);
$k = 0;
for ($i=0; $i<$cnt-1; $i++){
    for ($j=$i+1; $j<$cnt; $j++){
        //similar_text calculates differently depending on order of arguments
        similar_text($dexed[$i]['val'], $dexed[$j]['val'], $percent1);
        similar_text($dexed[$j]['val'], $dexed[$i]['val'], $percent2);
        if (($percent1 >= $t) || ($percent2 >= $t)){
            //check if value already exists under different key
            if (in_array($dexed[$i]['val'], array_keys($rev))){
                if ( ! in_array($dexed[$j]['val'], array_keys($rev))){
                    $fkey = $rev[$dexed[$i]['val']];//key found
                    $next = count($out[$fkey]);
                    $out[$fkey][$next]['val'] = $dexed[$j]['val'];
                    $out[$fkey][$next]['key'] = $dexed[$j]['key'];
                    $rev[$dexed[$j]['val']] = $fkey;
                }
            } else {
                $out[$k][0]['val'] = $dexed[$i]['val'];
                $out[$k][0]['key'] = $dexed[$i]['key'];
                $out[$k][1]['val'] = $dexed[$j]['val'];
                $out[$k][1]['key'] = $dexed[$j]['key'];
                $rev[$dexed[$i]['val']] = $k;
                $rev[$dexed[$j]['val']] = $k;
                $k++;
            }
        }
    }
}

Once $out is generated, use the following to generate an index array:

$index = array();
foreach ($out as $key => $group){
    $cnt = count($group);
    foreach ($group as $key2 => $word){
        for ($i=0; $i<$cnt; $i++){
            if ($i != $key2){
                $index[$word['key']][] = $key.':'.$i;
            }
        }
    }
}

Access all similar words for a given key (the key value for the word in the original array $arr);

$key = 2;
foreach ($index[$key] as $value){
    $parts = explode(':', $value);
    echo '<p>'.$out[$parts[0]][$parts[1]]['val'].'</p>';
}

Unfortunately, what you're proposing is slow if the list gets bigger than trivial and won't work very well. Here's something that might, and will also be algorithmically efficient.

First, create an inverted index of letter bigrams (http://en.wikipedia.org/wiki/Bigram). For example (assuming case insensitivity):

  1. "foo" => ^f,fo,oo,o$
  2. "hzallo" => ^h,hz,za,al,ll,o$

You can use an underscore instead of ^ and $, which are pseudocharacters. I think they'll help you rank the results.

Now to find similar words you can use a typical ranking algorithm (see tf*idf and simpler token-count-based algorithms) to rank the best matches. So, given "hallo,"

QUERY(^h,ha,al,ll,lo,o$) AGAINST index_of_words

& you'll get a good match for "hzallo" because ^h,al,ll,lo,o$ all match.

You'll need something like Solr or your database's TEXT index to do this unless you want to write a simple inverted index, but it's worth it. The lookup will be orders of magnitude faster than what you're entertaining, and the results will be ranked by closeness.

Afterwards, you can use something like levenshtein, but I don't think you'll need to in many cases.