I'm trying to code up a natural language parser and search engine in PHP. All of the ways that I have thought of thus far have been either cumbersome to implement, use, or not that efficient.
One of my ideas included a script that would perform regular expression on a simplified string, ie. various words removed from the string, and then the resulting string checked first for what the user is looking for - ie, "opening times", then if possible the venue they're searching for - lets say "Derngate". The rest is similar to that.
Can anyone point me in the direction of a more efficient way of doing things? I don't want to be doing 25 different regular expressions - or what ever the count is - per each page load if I can help it.
Many thanks!
Edit: I'm just curious, that's all. I'd rather make my own (to see how it works) rather than jumping into something like Lucene.
I think that after a review of the state of the art, I'd look at root/stem word extraction as a start. (Not too heavy a task if your document corpus is relatively static, since this can be done at document-capture time.)
There's a PHP extension for that, stem. http://pecl.php.net/package/stem
There's the Porter Stemmer implemented in PHP, that's the key operation in the above, implemented as a function.
You should look into mapReduce and parallelization:
http://code.google.com/edu/parallel/mapreduce-tutorial.html
Thats how google does it I believe. Of course, you dont have a billion computers to help you.
(I would also say doing this in pure php is going to be terribly slow)
You surely have to study a bit around Information Retrieval and Natural Language Processing. You'll not even get close to Google, Bing perfomance/effectiveness with regular expressions.
Also, if you want to do serious work in this area, you should probably move up to a more "efficient" language (C#, Java, C/C++...).