尝试使用preg_match_all将包含3个或更少字符的单词组成4个或更多字符的单词

I am trying to group words of 4 or more characters with words of 3 or less characters using preg_match_all() in PHP. I am doing this for a keyword search function where users can enter things like "An elephant" and I cannot have any results come back that have just "An" in them.

Therefore instead of breaking the keywords apart by spaces, (e.g. "An", "elephant") I need to put the keywords of three or less characters with the next or previous keyword. (e.g. "An elephant", "History of")

In order to accomplish this I am trying to use conditional sub patterns but I am not sure if I am really on the right track here.

Here's the best I've got so far:

(\s\S{1,3}\s*)?(?(1)\S+)

Yet I seem to also be matching a whole bunch of empty spaces as well. Can someone please point me in the right direction?

In the case of "History of elephants" I am trying to get it to create two matches: "History of", and "elephants".

I cannot simply omit the "stop words" because they are important in this case. The real-life use case is searching for course titles such as "Calculus A" and in that case "A" is important.

See if this would match your needs:

\b(?:[\w'-]{1,3}\W+[\w'-]{4,}|[\w'-]{4,}\W+[\w'-]{1,3}|[\w'-]{4,})\b
  • Starts at \b word boundaries where it...
  • [\w'-]{1,3}\W+[\w'-]{4,} matches 1-3 word characters, followed by \W+ one or more non-word characters, followed by [\w'-]{4,}\b 4 or more word characters.
  • |[\w'-]{4,}\W+[\w'-]{1,3} or matches first the 4+ words followed by shorter ones.
  • |[\w'-]{4,} or matches any words with at least 4 characters. (reduce if needed)

Test at regex101.com; Regex FAQ

Also see the problems if input is such as "I visted Calculus A, you in Calculus B?"; Outputs: I visted, Calculus A, in Calculus because of the priority of preceding words.


And a PHP-example ($out[0] would hold the matches)

$str = "
An elephant in the garden 
history of elephants
Algebra A B-movies";

$pattern = '~\b(?:
[\w\'-]{1,3}\W+[\w\'-]{4,}|
[\w\'-]{4,}\W+[\w\'-]{1,3}|
[\w\'-]{4,}
)\b~x';

if(preg_match_all($pattern, $str, $out)) {
  print_r($out[0]);
}

outputs to:

Array
(
    [0] => An elephant
    [1] => the garden
    [2] => history of
    [3] => elephants
    [4] => Algebra A
    [5] => B-movies
)

Test at eval.in (link expires soon)

There are some complications with what you're trying to do, it gives rise to ambiguities. Is History of elephants [History of] [elephants] or [History] [of elephants]? You're probably better of just excluding a set of specific stop words or words that meet some criteria.

If you want to exclude words of 3 or less characters, you might try the following. You say you're already splitting the keywords at spaces, so you should have an array of words. You can just array_filter that array based on word length (> 3 chars), and you should have the list of words you want to use.

$words = array('no', 'na', 'sure', 'definitely');

function length_filter($word) {
    return mb_strlen($word) > 3;
};

$longer_than_3 = array_filter($words, 'length_filter');
print_r($longer_than_3);

// Array
// (
//     [2] => sure
//     [3] => definitely
// )