Let's say that I have a MySQL database with a large number of entries in it. Let's say 10k rows for now.
I have a task that I want to perform on each row of the table. The task can take anything from less than a second to a few seconds, but can be done in parallel on different rows of the database. The task involves reading the row, loading a URL via CURL or loading data from disk, updating some of the fields, and saving the updated data back to the row with an updated data and a timestamp for when they were processed.
My question is how should I best structure my execution of the task to achieve the following:
What I'm thinking is to have one process which reads the IDs of the set of rows that need processing. This can then be array_chunked into manageable sections which are passed to processes spawned on the shell with exec. The passing is done either through the database (mark rows 1-2000 for execution with process 1), by the command line, or by saving a CSV file.
The problem I see with that is that it'll leave some of the processes idle for a lot of the time. Once process might finish iots batch of 1000 only to find that another process has been a lot slower and still has 500 to go. This second process could easily speed things up by taking another 250 rows off the slower process.
I'm thinking there's probably a standard architecture that I'm missing here which is applied to this sort of process, or am I barking up the wrong tree?
Please stick to technologies that would be available on a standard LAMP server - I'm not really that interested in setting up Hadoop or rewriting my code into another language. Still, if there's another technology that would probably work on a lamp server then go ahead and suggest it.