I am trying to group words of 4 or more characters with words of 3 or less characters using preg_match_all()
in PHP. I am doing this for a keyword search function where users can enter things like "An elephant" and I cannot have any results come back that have just "An" in them.
Therefore instead of breaking the keywords apart by spaces, (e.g. "An", "elephant") I need to put the keywords of three or less characters with the next or previous keyword. (e.g. "An elephant", "History of")
In order to accomplish this I am trying to use conditional sub patterns but I am not sure if I am really on the right track here.
Here's the best I've got so far:
(\s\S{1,3}\s*)?(?(1)\S+)
Yet I seem to also be matching a whole bunch of empty spaces as well. Can someone please point me in the right direction?
In the case of "History of elephants" I am trying to get it to create two matches: "History of", and "elephants".
I cannot simply omit the "stop words" because they are important in this case. The real-life use case is searching for course titles such as "Calculus A" and in that case "A" is important.
See if this would match your needs:
\b(?:[\w'-]{1,3}\W+[\w'-]{4,}|[\w'-]{4,}\W+[\w'-]{1,3}|[\w'-]{4,})\b
\b
word boundaries where it...[\w'-]{1,3}\W+[\w'-]{4,}
matches 1-3 word characters, followed by \W+
one or more non-word characters, followed by [\w'-]{4,}\b
4 or more word characters.|[\w'-]{4,}\W+[\w'-]{1,3}
or matches first the 4+ words followed by shorter ones.|[\w'-]{4,}
or matches any words with at least 4 characters. (reduce if needed)Test at regex101.com; Regex FAQ
Also see the problems if input is such as "I visted Calculus A, you in Calculus B?"
; Outputs: I visted
, Calculus A
, in Calculus
because of the priority of preceding words.
And a PHP-example ($out[0]
would hold the matches)
$str = "
An elephant in the garden
history of elephants
Algebra A B-movies";
$pattern = '~\b(?:
[\w\'-]{1,3}\W+[\w\'-]{4,}|
[\w\'-]{4,}\W+[\w\'-]{1,3}|
[\w\'-]{4,}
)\b~x';
if(preg_match_all($pattern, $str, $out)) {
print_r($out[0]);
}
outputs to:
Array
(
[0] => An elephant
[1] => the garden
[2] => history of
[3] => elephants
[4] => Algebra A
[5] => B-movies
)
Test at eval.in (link expires soon)
There are some complications with what you're trying to do, it gives rise to ambiguities. Is History of elephants
[History of] [elephants]
or [History] [of elephants]
? You're probably better of just excluding a set of specific stop words or words that meet some criteria.
If you want to exclude words of 3 or less characters, you might try the following. You say you're already splitting the keywords at spaces, so you should have an array of words. You can just array_filter
that array based on word length (> 3 chars), and you should have the list of words you want to use.
$words = array('no', 'na', 'sure', 'definitely');
function length_filter($word) {
return mb_strlen($word) > 3;
};
$longer_than_3 = array_filter($words, 'length_filter');
print_r($longer_than_3);
// Array
// (
// [2] => sure
// [3] => definitely
// )