通过命令脚本以正确的顺序运行来最小化运行时间

I am using scrapy and scrapyd to crawl some content. I have 28 crawlers that run, but only 8 at a time. Each crawler takes from 10 min to several hours to complete. So im looking for a way to order them correctly, in order to minimize the time the server is active.

I already gather information of how long each crawl takes, so it's only the minimization problem, or how to formulate it.

The script is started using php, so the solutions should preferably run in php.

The best way I've found is to set them up as cronjobs to execute at specific times. I have around 30 cronjobs configured to start at various times meaning you can set specific times per scrap.

Executing a PHP cmmand by cronjob at 5pm every day:

* 17 * * * php /opt/test.php

If you execute the scrapy python command via cronjob, its:

* 17 * * * cd /opt/path1/ && scrapy crawl site1

If your using virtualenv for you python then its

* 17 * * * source /opt/venv/bin/activate && cd /opt/path1/ && scrapy crawl site1

Sorry to disappoint you but there's nothing clever nor any minimization problem in what you describe because you don't state anything about dependencies between crawling jobs. Independent jobs will take ~ TOTAL_TIME/THROUGHPUT regardless of how you order them.

scrapyd will start processing the next job as soon as one finishes. The "8 at a time" isn't some kind of bucket thing so there's no combinatorial/dynamic programming problem here. Just throw all 28 jobs to scrapyd and let it run. When you poll and find it idle, you can shut down your server.

You might have some little benefits by scheduling longest jobs first. You can quickly squeeze a few tiny jobs on the idle slots while the last few long jobs finish. But unless you're in some ill case, those benefits shouldn't be major.

Note also that this number "8" - I guess enforced by max_proc_per_cpu and/or max_proc - is somewhat arbitrary. Unless that's the number where you hit 100% CPU or something, maybe a larger number would be better suited.

If you want major benefits, find the 2-3 top largest jobs and find a way to cut them in half e.g. if you're crawling a site with vehicles split the single crawl to two, one for cars and one for motorbikes. This is usually possible and will yield more significant benefits than reordering. For example, if your longer job is 8 hours and the next longer is 5, by splitting the longest to two-4hour crawls, you will make the 5-hour one be the bottleneck potentially saving your server 3 hours.