用PHP创建自然语言搜索引擎

I'm trying to code up a natural language parser and search engine in PHP. All of the ways that I have thought of thus far have been either cumbersome to implement, use, or not that efficient.

One of my ideas included a script that would perform regular expression on a simplified string, ie. various words removed from the string, and then the resulting string checked first for what the user is looking for - ie, "opening times", then if possible the venue they're searching for - lets say "Derngate". The rest is similar to that.

Can anyone point me in the direction of a more efficient way of doing things? I don't want to be doing 25 different regular expressions - or what ever the count is - per each page load if I can help it.

Many thanks!

Edit: I'm just curious, that's all. I'd rather make my own (to see how it works) rather than jumping into something like Lucene.

I think that after a review of the state of the art, I'd look at root/stem word extraction as a start. (Not too heavy a task if your document corpus is relatively static, since this can be done at document-capture time.)

There's a PHP extension for that, stem. http://pecl.php.net/package/stem

There's the Porter Stemmer implemented in PHP, that's the key operation in the above, implemented as a function.

You should look into mapReduce and parallelization:

http://code.google.com/edu/parallel/mapreduce-tutorial.html

Thats how google does it I believe. Of course, you dont have a billion computers to help you.

(I would also say doing this in pure php is going to be terribly slow)

You surely have to study a bit around Information Retrieval and Natural Language Processing. You'll not even get close to Google, Bing perfomance/effectiveness with regular expressions.

Also, if you want to do serious work in this area, you should probably move up to a more "efficient" language (C#, Java, C/C++...).