My intention is to eliminate junk words and keep an array of useful phrases. Eg. 'I like to eat marshmallows while also listening to Metallica'. I would like to eliminate the words I,to,while,also. This in turn wold produce an array 0- like 1 - eat marshmallows 2- listening 3- Metallica
i tried preg_split and separating each word with a | and enclosing each word in brackets
$arr = preg_split("/ (\bwhere\b)| (\bany\b) |(\bfacebook\b)|(\bthe\b)|(
)|()|(
)|(,) | (\band\b)| (\bundefined\b) /", $bigString);
Problems i encountered : a) if the first word in the string is matched in the regex it is still not eliminated. For some reason it is still kept in the string and stored in the array. b) consecutive matches are sometimes ignored. Eg. take the string 'I eat a lot'. Even though all 4 words should be caught by the regex , the word 'a' is still stored in the array.
The two problems (a and b) have the same origin, each token in your pattern is surrounded by spaces. Consequences: a) it doesn't work when one of the token is at the start or at the end of the string. b) it doesn't work when you have consecutive tokens in your string since you can't match a same space twice.
Whatever, your approach that consists to build an alternation with all of these words is not good, because the pattern performance decreases each time you add a new branch to your alternation (for each position in the string, in the worst case, the regex engine needs to test all the branches).
That's why I suggest an other approach that consists to split the string for example by non-letter characters (to be more precise at each sequence of blank characters, and at each sequence of non-letter and non-blank characters). Once done, I use array_diff
to remove words you don't want. The main interest of array_diff
is that it preserves the keys. This way you only have to find the gaps in the keys to produce the result array.
Even if it looks more complicated and longer, this way is from far more scalable:
$str = 'I like to eat marshmallows while also listening to Metallica';
$words = [ '',
'also', 'and', 'any',
'I',
'facebook',
'the', 'to',
'where', 'while' ];
$parts = array_diff(preg_split('~(?=\PL)(?:\s+|[^\pL\s]+)~u', $str), $words);
$previousKey = false;
$temp = '';
$result = [];
foreach($parts as $k => $v) {
if ( $previousKey === $k - 1 ) {
$temp .= " $v";
} else {
if ( $previousKey )
$result[] = $temp;
$temp = $v;
}
$previousKey = $k;
}
if ( $previousKey )
$result[] = $temp;
print_r($result);
pattern details:
~
(?=\PL) # improvement trick: make fail quickly positions with a letter
# without to test the whole pattern
(?:
\s+ # any sequence of white-spaces
| # OR
[^\pL\s]+ # any sequence of characters that are not letters or white-spaces
#
# This way: "eat marshmallows" returns:
# [0] => eat marshmallows
# but: "eat, marshmallows" returns:
# [0] => eat
# [1] =>
# [2] => marshmallows
# according to your original pattern
)
~u # make it able to deal with multibyte utf8 strings
Better pattern: ~\PL(?(?<=\s)\s*|[^\pL\s]*)~u