如何在Microsoft Azure上调试慢速网站

We have a website up and running for almost 3 years now. It was only up until about 2-3 months ago that we implemented a new feature "multilanguage". And right after that time we started noticing some lag issues. The first one was very severe and caused the server to be out nearly the entire time, we discovered an infinite loop somewhere in our (which we think was picked up by search robots which were stuck in there and took up all resources, and thus causing the server to crash).

However, that issue was fixed and we are (FAIRLY) certain that there's no longer a code-issue causing this. But we are not 100% certain.

Once in a while (note: more than just a couple of times during the day), our server (hosted on Microsoft Azure) will randomly take about 2-3minutes before doing anything back. This goes from showing a web page, to querying our database (using mysql workbench). It just takes about 2-3 minutes for anything to load.

We have looked in Google Analytics and our Apache logs to try and find a pattern what causes this issue. But we cannot find any sort of pattern. There is nothing unusual happening in our Apache logs right before the lag issue strikes. On top of that, we even have these lag issues when there are 0 visitors on our website (in the middle of the night according to google analytics).

Our largest mysql table contains about 50K records, so it's really not even that big of a database. We have around 100 tables in total.

When the server is performing fine, I go to mysql and perform some of our 'heavier' queries manually there to see if they're really slow, but none of them take longer than 0.5 seconds. But when the server is lagging, it can easily take up to 30 - 60 seconds.

We have some CRON jobs running in the background, and in particular 2 of them might cause us issues, but as well, i'm very uncertain about this. The first one is a mailing CRON. We have a queue in our database which contains all of our emails, and a boolean 0 or 1 indicating whether it's sent or not. This CRON job will run each 5min and fetch the emails with sent to 0 and try sending them. Next to this we have another CRON job which generates emails and sends them to our user database. This can send out up to 500 emails at a time (only ran once per 2 weeks). Sometimes we reach our maximum emails sent per day limit that is set by Outlook. This causes our emails to just stop sending for a day, but the next day it will be sent out again.

At first I thought, maybe here's the issue with an emailing CRON job that executes longer and 2 cron jobs overlap. However, I've done a test where a CRON job should sent 500 emails, knowing that we were on the limit and none of them should be sent out. So I let the CRON job run manually, and about 15 seconds later it was over and none were sent. The website ran perfectly smooth during these 15 seconds. And another test where 100 emails were sent (knowing we did not reach the daily limit yet). It took about 20 seconds for all emails to be generated and sent. So an overlap of these CRON jobs is very unlikely.

I'm completely hard-stuck right now. We are trying to contact Microsoft and see if they can figure out if there's an issue on their side, but no luck so far.