与AWS DynamoDB通信时的全秒延迟

I am working to diagnose a strange symptom that I am seeing in my system's performance logs where occasionally, writes to DynamoDB are taking multiple seconds to complete.

To give some background on the infrastructure:

  • This is a web application hosted in AWS (East region)
  • An ELB covering two EC2 instances (c3.2xlarge)
  • Single DynamoDB table provisioned for 1000 reads/writes per sec

I have a PHP web service running on the EC2 instances that receives a small request from the connecting user, writes a record to dynamo, and returns an empty response. The amount of data transferred is well under 1kb. The traffic volumes to the ELB vary throughout the day, but range from a few requests per second to over 100. An important note here is that I wrote a small CURL client directly for the dynamo operations I need since the AWS SDK was introducing a 70ms overhead on top of our processing times.

The only options I am using are:

CURLOPT_POST
CURLOPT_RETURNTRANSFER
CURLOPT_TCP_NODELAY

On to the behavior I'm seeing: For 99% of cases, the response times recorded by the ELB are under 30ms which is our target. However if I change the graph to show maximums over the same range, I see a much different picture (graph), with times falling into near-whole second intervals. This graph does not correlate with load, since it happens even in the middle of the night where traffic is minimal. These trends prompted me to dig deeper and I was able to isolate the delay to the connection to DynamoDB.

These seem like programmatic thresholds, and not general noise. My first thought was that it was exponential backoff, but I'm not seeing an indication in the responses from the dynamo web service that would suggest this (all 200's), and our throughput right now is 10% of our provisioning, even at peak. And again, we see this performance trend even in the middle of the night when traffic is minimal. If it's not backoff, then it definitely feels like some kind of throttling. Ideas?

Using CURL's verbose output for my Dynamo API calls, I see entries like this:

Line 151190: 2014-03-12T16:48:35-04:00 - INFO - Time: 1.001436, (Start Transfer: 1.001416, DNS: 2.6E-5, Connect: 0.998141, Pre-Transfer: 0.998199)

Line 196871: 2014-03-12T16:48:42-04:00 - INFO - Time: 1.001528, (Start Transfer: 1.001488, DNS: 3.1E-5, Connect: 0.99694,  Pre-Transfer: 0.996981)

Line 430508: 2014-03-12T16:49:19-04:00 - INFO - Time: 1.002823, (Start Transfer: 1.002807, DNS: 3.2E-5, Connect: 0.998972, Pre-Transfer: 0.999009)

Line 870236: 2014-03-12T16:50:31-04:00 - INFO - Time: 1.000663, (Start Transfer: 1.000642, DNS: 3.0E-5, Connect: 0.001506, Pre-Transfer: 0.001537)

Line 950109: 2014-03-12T16:50:43-04:00 - INFO - Time: 1.010762, (Start Transfer: 1.010737, DNS: 3.3E-5, Connect: 0.001357, Pre-Transfer: 0.001394)

The key find here is it seems like in 50% of cases it's connection time, and for the other 50%, it's unclear where the time is being spent. It seems like there's another component here that may be contributing to the end-to-end time, but I'm struggling to pinpoint what that might be.

Any help would be greatly appreciated!

Make you are not sending an Expect: 100-continue header, which is common cause for a ~1 second delay like this. I believe cURL sends this header by default, so you will need to do something to remove it.

If you end up using the AWS SDK for PHP, make sure you are also using an opcode cache like APC or Zend Opcache.