Looking for a smart, very light and creative way to convert a title string into tokenized object but take into consideration non-splittable known two-worded predefined dictionary words.
I.e.: dictionary contains over 300 words / wordsets such as: sheet set, jacket, suit, oxford shoes
String may contain something like: 4-Piece 1000TC 100% Cotton Queen Sheet Set in Ivory
I would like to get resulted array that is stripped off all noisy words (ie. remove any words that have numbers or not long enough)
so first i do regex and strip everything that is not a-zA-Z at least {2,} char long
then I want to receive the following array:
where sheet set would remain as a single token since it is contained in our dictionary.
And I'm looking for a solution that would work very very fast since there're thousands of parallel processes and I'm trying to come up with a way to save on as many iterations as possible and the dictionary keeps on growing as well.
If you need something real fast, you might consider to build a tree-based structure from your dictionnary (each character would be linked down to the next one), then at each space, you have to try to go down the tree.
You can have a look for http://en.wikipedia.org/wiki/Trie
However, if speed is a primary concern, you have to avoid php.
Let's assume you have your dictionary stored in a simple array. Then a handy regexp come in:
<?php
$dictionary = array('sheet set', 'jacket', 'suit', 'oxford shoes');
$regexp = implode('|', $dictionary);
$regexp .= '|[a-z]{2,}';
$regexp = '/(?<=[^\w-]|^)('.$regexp.')(?=[^\w-]|$)/i';
// final regexp looks like this:
// /(?<=[^\w-]|^)(sheet set|jacket|suit|oxford shoes|[a-z]{2,})(?=[^\w-]|$)/i
$subject = '4-Piece 1000TC 100% Cotton Queen Sheet Set in Ivory';
preg_match_all($regexp, $subject, $matches);
Matches are (full pattern, first index of $matches table):
array(5) {
[0]=>
string(6) "Cotton"
[1]=>
string(5) "Queen"
[2]=>
string(9) "Sheet Set"
[3]=>
string(2) "in"
[4]=>
string(5) "Ivory"
}
PS 'in' matches the pattern because there is 2 character minimum, you can tweak it to 3 to get desired result.
Brief explanation:
i
modifier ensure that string is matched case insensitive(?<=[^\w-]|^)
and (?=[^\w-]|$)
are a lookarounds that ensures theres nothing interesting outside the searched wordAnd the performance test: http://3v4l.org/siK9h/perf#tabs