计算文本中单词的出现次数

I have a text in which I would like to calculate occurences of the phrase "lorem ipsum dolor".

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.

The algorithm should be counting occurrences even if the searching phrase is written in different order. I've highlighted expected results. Is there any better way to achieve that than using regular expression with every possible combination?

In this case the result should be equal to 3

Lorem ipsum dolor
Ipsum lorem dolor
Dolor ipsum lorem

The phrase will have about 3-4 words and string will be a content of web page.

$haystack = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';
$needle = 'Lorem ipsum dolor';

$hayWords = str_word_count(
    strtolower($haystack), 
    1
);
$needleWords = str_word_count(
    strtolower($needle), 
    1
);
$needleWordsCount = count($needleWords);

$foundWords = array_intersect(
    $hayWords, 
    $needleWords
);

$count = array_reduce(
    array_keys($foundWords),
    function($counter, $item) use ($foundWords, $needleWordsCount) {
        for($i = $item; $i < $item + $needleWordsCount; ++$i) {
            if (!isset($foundWords[$i]))
                return $counter;
        }
        return ++$counter;
    },
    0
);

var_dump($count);

You could try the regex:

/(?:(?:(?:lorem|ipsum|dolor)\s?)+)/gi

with preg_match_all and then count the number of matches. From your sample, you should get 3 matches.

I'm not too good at algorithms nor at PHP, but an attempt...

<?php

$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';

$lower_string = strtolower($string);

$text = array('lorem', 'ipsum', 'dolor');

$perms = AllPermutations($text);
$result = 0;
foreach ($perms as $piece) {
    $phrase = join(' ', $piece);
    $result += substr_count($lower_string, $phrase);
}

# From http://stackoverflow.com/a/12749950/1578604
function AllPermutations($InArray, $InProcessedArray = array())
{
    $ReturnArray = array();
    foreach($InArray as $Key=>$value)
    {
        $CopyArray = $InProcessedArray;
        $CopyArray[$Key] = $value;
        $TempArray = array_diff_key($InArray, $CopyArray);
        if (count($TempArray) == 0)
        {
            $ReturnArray[] = $CopyArray;
        }
        else
        {
            $ReturnArray = array_merge($ReturnArray, AllPermutations($TempArray, $CopyArray));
        }
    }
    return $ReturnArray;
}

echo $result;
?>

ideone demo

I think you are looking for this: http://nl1.php.net/substr_count

$text = 'This is a test';
echo strlen($text); // 14

echo substr_count($text, 'is'); // 2

// the string is reduced to 's is a test', so it prints 1
echo substr_count($text, 'is', 3);

// the text is reduced to 's i', so it prints 0
echo substr_count($text, 'is', 3, 3);

// generates a warning because 5+10 > 14
echo substr_count($text, 'is', 5, 10);


// prints only 1, because it doesn't count overlapped substrings
$text2 = 'gcdgcdgcd';
echo substr_count($text2, 'gcdgcd');

Note: Works with "Lorem ipsum dolor dolor" as well.

Good evening everyone. I have figured out another technique. This one uses a varying approach to what Mark Baker did, which I appreciate very much. Also, go down to see memory usage.

In a nutshell, it takes basic string (lorem ipsum dolor) that needs to be matched, which is then shuffled into all possible combinations (in this case 3! = 6).

Further,all those 6 combination of strings are then added into array which is used to do the matching substr_count. I am also using shuffle(), in_array and array_push.

The code is self explanatory and if you are curious, here's my IDEONE. This is Mark Baker's solution on IDEONE. They both take the same amount of time and memory, and my solution is 4 lines shorter, if not more elegant :)

<?php

    $string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';

//convert main string to lowercase to have an even playing field
    $string2 = strtolower($string);
    $substring = 'lorem ipsum dolor';

//add the first lorem ipsum dolor to launch the array 
    $arr = array($substring);

//run until the array is full with all possible combinations i.e. 6 (factorial of 3)
    for ($i=0; $i<=20; $i++) {
        $wordArray = explode(" ",$substring);
        shuffle($wordArray);
        $randString= implode(" ",$wordArray);

//if random string isn't in the array, then only you push the new value 
        while (! (in_array($randString,$arr)) ) {
            array_push($arr,$randString);
        }

    }

//var_dump($arr);

//here, we do the matching, and this is pretty self explanatory
    $n = sizeof($arr);
    for ($q=0; $q<=$n; $q++) {
        $sum += substr_count($string2,$arr[$q]);
    }

    echo "Total occurances: ".$sum;

?>

Memory Usage

As you can already see, Mark's code ups me on +2 occasions, but the difference is very negligible due to the nature of this programme, and the data complexity associated. Obviously, the difference could be big given the program's complexity, but this is what it is.

enter image description here