I'm using a list of synonyms to direct a process of query expansion. The format looks like this:
fu=foo
ba=bar
etc=etcetera
werd=word
I'm using a straightforward binary search algorithm to run each of the user input words against this list. The problem is, when it comes to using phrases.
quick brown fox=alphabet
out of this world=space
why hello there=hello
Typical input: why hello there, where can I get an out of this world hopper?
And the desired output is: hello, where can I get an space hopper?
I don't want to run each word pair or tripple through the search too, and I want to avoid a linear search of the thesaurus list against the input as this is inefficient (although the list should be quite small so this an option).
Therefore I'm looking for ways to run binary search on phrases, or to construct the thesaurus in such a way as to compensate for phrases.
I'm using PHP for this. Any suggestions most welcome.
The simple approach would be using str_replace. I don't know about the performance though.
$list = array('out of this world' => 'space');
$str = 'why hello there, where can I get an out of this world hopper?';
foreach ($list as $old => $new) {
$str = str_replace($old, $new, $str);
}
Edit: I've often noticed that it's more efficient to use built-in functions instead of writing your own because the built-ins are already compiled but your optimized algorithm needs to be interpreted which is a huge slowdown.
My first idea would be to use an associative array like this
$thesaurus = array(
'alphabet' => 'quick brown fox',
'space' => 'out of this world',
'hello' => 'why hello there'
);
That way you can use built in array_search functions which will be faster than anything you could write in PHP (I think).
Use preg_replace_callback
instead of the whatever you did now. PCRE happens to be quite efficient at string searching, because that's what it was made for.
You just need to build a single alternatives list, then do the actual replacing via the original map/dictionary in the callback.
$phrases = array(...);
$rx = implode("|", array_keys($phrases));
$text = preg_replace("/\b($rx)\b/musie", '$phrases["\1"]', $text);
Just using an /e
expression here, a callback might be more useful.