I realize I'm going to probably need to bring in an expert to assess code/etc but would love some first round thoughts/input if possible;
I am running load tests on a web site prior to deployment on a dedicated 32Core/96GB/SSD Server. I had done some before but just testing CPU/Mem etc
I was able to add a metric for a specific user input to map load times.
I did a ramp up to a conservative 500 user over 10 mins, plateaued for 10 mins and ramped down for 10 mins.
CPU shows never going above 45%, Memory never about 8%.
Reponse times are fine to about 100 user (250ms) then start spiking to 25 seconds.
The weird thing is no matter how many users I put in the test (20, 50, 100, 250, 500) I get the same out of control spikes AND on a 5 minute interval.
Point being clearly (layman clearly) the server has enough CPU/Memory, neither is getting overwhelmed.
The only consistency in all the tests is; a) the spikes happen on 5 minute intervals and b) the network bandwidth simiultaneously TANKS.
I could understand if the bandwidth spiked and we went over server capacity and response times tanked too; but it is a direct correltation; bandwidth tanks, response times spike, bandwidth climbs, response times go back (more or less) to normal.
We have a) optimized queries b) optimized tables c) optimized db d) resized server twice e) checked the logs for the core query, which uses Sphinx to confirm that those are happening in milliseconds.
It appears to be browswer as we confirmed during a load test a specific query took 1 second and the time to display took 2 minutes.
ANY thoughts on the right direction go before bringing in the big guns highly appreciated.
Server is CentOS 64-bit, site is php/mysql/javascript/sphinx.
Do you have a monitoring system such as Munin installed? If these spikes happen every 5 minutes as you state, Munin will reveal what resources are being choked when these spikes happen & you can then debug further based on that.
Also, what logs—if any—are you reviewing? Simply running stress tests will not do much in & of themselves.
I would also suggest reading my answer to a similar server related question I provided today. There are tons of factors to take into account. Get your server in gear & then the spikes should disappear.