I'm currently building a web application that needs to access a (really) big database of strings and compare them to a given 'request string' - This has to be done over and over again using different comparison methods (number of identical words, order of words...) and should be scalable and, more important, fast.
I thought about about implementing the comparison method itself in C, because it's obviously much faster than interpreted, though more 'webby' languages like PHP. This brought me to three questions:
1) How am I supposed to 'connect' the C application to the web server (currently Apache)? I thought about the usual CGI-way, but because of its need to create one new process per request it would be less scalable and fast - at least that's what I read about it.
2) Which database technology is best to use with C for this use-case?
3) Last but not least, do you think it's worth the struggle or would it be enough to go the usual way by building a PHP-Script that connects to a MySQL database - how big is the speed difference?
Thanks in advance, David
Apache-2.4.x has mod_dbd, which can talk to various database back-ends (using precompiled queries for speed) straight out of Apache config-files. This can be used in, for example, mod_rewrite's rules to alter incoming requests depending on results of the queries.
In other words, depending on the details of your application, you may be able to avoid coding altogether.
If you do have to write code, however, I'd suggest you first implement the prototype in a language you know best. There is a good chance, it will be "fast enough" -- or that the bulk of the time spent per request will be on the database side making the choice of the language irrelevant.
Connecting to a database back-end from C is easy -- each database vendor has (at least one) client-library implementation for C-programs.
If your app does not require multiple HTTP-servers to talk to the same (single) database, then your best DB would be a local one: SleepyCat's (now Oracle's) "db", gdbm, or SQLite3.
Also, if the updates (changes to the database) are infrequent, you can use flat-files and build the database in memory... This will limit the database's maximum size, but will be the fastest and allow you avoid handling (re)connections, etc.
EDIT: memcache is an external database -- whether it is running on the same system or a remote one, you have to connect to it, talk to it, and copy data from its memory into yours. If database-changes aren't frequent, you are better off keeping stuff in your own memory and using it straight from there.
Bad application architecture, bad database design and bad code will always run inefficiently slow won't scalable.
If you get that out of the way most "very high demand" purposes can be served with any of the interpreted languages - remember they're optimized at what they do (wasteful with memory for example but usually pretty fast even for high demand use).
In database design there is no perfect approach for all use-cases. You may need to structure your database in one way to achieve the best reading speed, and in another to achieve the best writing speed, and yet another to achieve the best flexibility (but sacrificing both read and write speed). A section may need high read speed another may need high write speed and yet another may need high flexibility.
Think of the way you designed the database and ask yourself "do i need to connect to the database 300 times in a session to gather all of the data or could I write one big statement that can read it all at once" if this is not easily achievable think if you can write an SQL stored procedure that can do that if you come up empty again think if you could change the structure (sacrificing flexibility for example) to allow for a one-liner read or a stored procedure read.
In any case many-many connections from PHP to MySQL sending 1 query means you're spending a hefty amount of time waiting for PHP to connect to the MySQL server over the network (even if it's local) and for MySQL to process the request and supply an answer etc. and so on and so forth. If you can either batch-generate all of the statements you intend to send into one string that's great, if not if you can group them in smaller batches that's OK as well.
You should consider writing a C MySQL module defining some UDF - user defined functions. They can have access to row-level data at the moment MySQL is reading it, and they can process for output aggregate and all that jazz.
Writing an apache module or a PHP module will deliver data in much the same format as the MySQL module but the processing won't be running inside the MySQL server it will be requesting data from MySQL, processing it and requesting more data.