如何改进我的算法来找到像twitter这样的Hot-Topics

I have created a cron job for my website which runs every 2hours and it counts the words in the feeds, and then displays the 10 highest count words as the hot topics.

Some thing what twitter does on there homepage, to show the most popular topics that are being discussed.

What my cron job does right now is it counts the words except for the words that i have mentioned, words like:

array('of', 'a', 'an', 'also', 'besides', 'equally', 'further', 'furthermore', 'in', 'addition', 'moreover', 'too',
                        'after', 'before', 'when', 'while', 'as', 'by', 'the', 'that', 'since', 'until', 'soon', 'once', 'so', 'whenever', 'every', 'first', 'last',
                        'because', 'even', 'though', 'although', 'whereas', 'while', 'if', 'unless', 'only', 'whether', 'or', 'not', 'even',
                        'also', 'besides', 'equally', 'further', 'furthermore', 'addition', 'moreover', 'next', 'too',
                        'likewise', 'moreover', 'however', 'contrary', 'other', 'hand', 'contrast', 'nevertheless', 'brief', 'summary', 'short',
                        'for', 'example', 'for instance', 'fact', 'finally', 'in brief', 'in conclusion', 'in other words', 'in short', 'in summary', 'therefore',
                        'accordingly', 'as a result', 'consequently', 'for this reason', 'afterward', 'in the meantime', 'later', 'meanwhile', 'second', 'earlier', 'finally', 'soon', 'still', 'then', 'third');       //words that are negligible

But this does not completely solves the issue of eliminating all the non-required words. And give only the words that are useful.

Can someone please guide me on this, and tell me how can i improve my algorithm.

Regards Zeeshan

Here's how we implemented this for the DjangoDose live feed during DjangoCon (note: this is a hackjob, we wrote it in 1 afternoon with no testing, and be yelling Bifurcation occsaionally, as best I can tell bifurcation has nothing to do with anything). All that being said, it more or less worked for us (meaning in the evenings beer was tracked appropriately).

IGNORED_WORDS = set(open(os.path.join(settings.ROOT_PATH, 'djangocon', 'ignores.txt')).read().split())

def trending_topics(request):
    logs = sorted(os.listdir(LOG_DIRECTORY), reverse=True)[:4]
    tweets = []
    for log in logs:
        f = open(os.path.join(LOG_DIRECTORY, log), 'r')
        for line in f:
            tweets.append(simplejson.loads(line)['text'])
    words = defaultdict(int)
    for text in tweets:
        prev = None
        for word in text.split():
            word = word.strip(string.punctuation).lower()
            if word.lower() not in IGNORED_WORDS and word:
                words[word] += 1
                if prev is not None:
                    words['%s %s' % (prev, word)] += 1
                    words[prev] -= 1
                    words[word] -= 1
                prev = word
            else:
                prev = None
    trending = sorted(words.items(), key=lambda o: o[1], reverse=True)[:15]
    if request.user.is_staff:
        trending = ['%s - %s' % (word, count) for word, count in trending]
    else:
        trending = [word for word, count in trending]
    return HttpResponse(simplejson.dumps(trending))

Welcome to the wonderful world of language processing. Basically, anything like trending topics and friends are a search for anomalies in language use.

Theoretically, by analyzing the frequency of the words over time, you should be able to filter out the noise ( common words, like the ones you listed above ). This is not trivial to implement, but definitely a possibility.

Another appraoch would be to not concentrate on the raw amount of words in a specific period of time, but rather on the pattern in which trending topics develop. They usually grow somewhat exponential, and it should be possible to revalidate the results of your existing search by trying to apply a filter which drops all "hot words" that do not conform to this kind of growth.

Just some thoughts :-)

edit:

to further outline what i meant with filtering by frequency, maybe you should check on dictionaries containing frequency information about words. It's not so hard to built on, and with a solid text corpus ( wikipedia is free to download, i used it for a test ) you'll get remarkable good results.

What you are looking for is generally called a stop word list. Here's a blog post with a list of them (even in PHP array format for you) and another text file version.

A bit more googling should find you some other examples.

A few more potential ideas for improving the algorithm in general:

  1. Weight the word usage by recency. You already do this by recalculating every 2 hours, but you could also factor in the exact time since the word was used as well. So, rather than having each mention of a word be worth 1 "point", it's point value would be determined by the time in minutes since the message containing it was posted.

  2. Create a database table of words and their average frequency in messages on your site. When you examine the messages created in the last X hours, compare the word frequency to the average frequency in your database. Those words that have a frequency significantly higher than the average would be considered "hot". Make sure you re-calculate the average frequency for words on a semi-regular basis (once a day maybe?)

I don't really know if you're looking for a better way to filter unimportant words or for a way to generate the actual top ten list from a set of valid words.

For filtering, I would suggest a blacklisting approach if you want to keep it simple. If you want something more sophisticated, you could build up a statistic that determines words that are used fairly often which are then filtered from your word list.

For counting, sorting and truncating the actual "trending topic" list, I's suggest this:

function buildTopTen($word = null)
{
    static $wordList = array();
    if (isset($word)) {
        if (!isset($wordList[$word])) {
            $wordList[$word] = 0;
        }
        $wordList[$word]++;
        return $wordList[$word];
    } else {
        arsort($wordList);
        return array_slice($wordList, 0, 10);
    }
}

Just call the function with a word parameter until you're finished. It will give you the current count of that one word in return, if thats of any use to you.

Call it once without parameters and it will give you the top ten most often used words from the word you fed to it.

I tested it and performance seems OK so far.

Of course this is just a suggestion and can be refined much further.

Here's an idea:

Compute the average usage frequency of every word in the English language. Put these in a lookup table. You will likely want to only keep the most common words. Pick a number of words (say, 5000), or a minimum frequency, that makes sense. You will likely still want to have a list of words that you never show. If you sort your list of words by frequency, it won't take you long to look through them and choose which words to always exclude.

To compute the frequency, you need an input sample. Your choice of input sample will affect the outcome. For example, Twitter could use every twitter message that has ever been posted as its input sample. Topics that are constantly being discussed on Twitter (for example, "Twitter" itself) would lose importance. If you want Twitter-specific topics to keep their significance, then find another input sample.

The algorithm for computing frequency is simple:

  • For each word in the sample:
    • Look that word up in the dictionary
    • Add one to a counter associated with that word
  • To normalize the data, divide the frequency of each word by the number of words in the input sample.

Now, if you run the same algorithm on today's Twitter posts, you can compare today's word frequencies with expected word frequencies.

If you want the statically significant outliers you may want to calculate a z-score for each word in a recent subset relative to the overall text.

So if

t is number of occurrences of word in subset
o is number of occurrences of word overall
n_t is number of words in subset
n_o is number of words overall

then calculate:

p_hat = t / n_t
p_0 = o / n_o

z = (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n_t)

The higher the z, the more statistically significant the mention of the word in the subset is relative to the overall text. This can also be used to calculate words that are oddly rare in the subset relative to the overall text.

You might want to research the usage of Markov Chains / Hidden Markov Models.

These mathematical models have been rather successful in Natural Language Processing.

You're accuracy on trending topics would be much higher. (And you can let it learn...)

You might want to check out NLTK (Natural Language Toolkit). There is a free book that would teach you how to use it available at http://www.nltk.org/book. The only downside is it's in python and I assume you need a PHP solution. Don't be too scared, because the book doesn't expect you to know any python beforehand.

NLKT is so very powerful and definitely worth looking into.

Here's a cheap-and-cheerful way to do it.

Every 2 hours, create a histogram of your word frequencies. Eg;

Snow, 150
Britney, 100
The, 150000

Keep all of these files.

Every so often, grab 50 files from your history and average the frequencies. This flattens out trends that have developed over time. So from these three files;

Snow, 10
Britney, 95
...

Snow, 8
Britney, 100
...

Snow, 12
Britney, 105
...

you get this baseline set;

Snow, 10
Britney, 100

Work out the ratio between this baseline set, and the most recent one;

Snow 1500%
Britney 100%

Your trends are the ones which have the highest ratios. Careful for divide-by-zeros here.

What's nice about this is that you tune your trends by picking data from a longer or shorter time period. Eg, you can see this month's trends by averaging over the year, and this afternoon's trends by averaging over the last 24 hours.

Edit -- with this algorithm, you don't need to worry about stop words, because they'll all have relatively stable ratios, at about 100%, so will always be boring (eg off-trend)

Is is not easier to scan each feed entry at creation time rather than doing a big, massive search every 2 hours?