Analyzing AWS Fargate

AWS Fargate is an interesting service in that it allows you to run containers without having to think (much) about cluster management, task scheduling, instance provisioning etc. It transforms the application deployment process into one where the application is packaged into a container and resources (CPU, RAM, Networking, access policies etc) are defined as part of a Fargate task. As such, it’s an attractive option that we’re evaluating as a deployment foundation for Cribl labs.

fargate

A Fargate task/container can be configured with a vCPU range of 0.25-4 and a RAM(GB) range of 0.5-30. As it’s typical of us, we went ahead and analyzed its physics to get a better understanding of what we’re dealing with. We have also done a similar analysis on Lambda.

We were specifically interested in the following measurements: 

  • IO :: Are latency and iops affected by CPU and RAM allocations?
  • Network :: Is throughput from S3 related in any way to CPU or RAM allocations?
  • CPU :: How do various configurations affect compute (gunzip performance)?
  • Container Ready Time :: How long does it take for a task/container to be ready?

Note that at this point we are not interested in the absolute performance of Fargate instances. We want to primarily get an understanding of their performance relative to each other; feel free to ignore the y-axis units.

Testing and Analysis


IO

Test: Run fio (direct IO, non-buffered) with block sizes of 1k, 4k, and 8k, across instances with these vCPUxRAM(GB) configurations: 1×2, 2×4, 4×8, and measure iops and clat perc95 .

Screen Shot 2018-05-27 at 12.13.56 PM.png

  • IOPS are pretty consistent across all instances. Their almost-perfect stability at just a tad over 3000, indicates that disk is served by EBS.
  • Latency seems to be affected by vCPU allocation. Small instances hover between 4-4.5ms and larger ones clock significantly better, between 1.5-2ms

Network Throughput

Test: download a file from an S3 bucket using curl to the local filesystem across instances with these vCPU x RAM(GB) configurations: 1×2, 4×8, 4×16, 4×30 and analyze throughput (wall clock time was used)

Screen Shot 2018-05-27 at 12.50.55 PM.png

Screen Shot 2018-05-27 at 12.51.09 PM.png

  • Network throughput is significantly better on instances with more vCPUs
  • Network throughput and variability tend to be better on instances with more RAM.

CPU Performance – Gunzip a Compressed File 

Test: gunzip a local file across instances with these vCPU x RAM(GB) configurations: 1×2, 4×8, 4×16, 4×30 and analyze throughput (compressed file size / wall clock time was used)

Screen Shot 2018-05-27 at 12.52.57 PM.png

  • Decompression performance is significantly better on instances with more vCPUs (we expect a step at 2 vCPU, ie a real core)
  • CPUs are ~5 year old toddlers Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 🙂 … this fact + memory / vCPU ratio indicate that Fargate is powered by R3 instances.

Container Ready Time

Test: pull images of various sizes from ECR, track pull time, convert it to throughput (imageSize/Pulltime) and see how it varies against different vCPUs and RAM(GB) allocations (1×2, 4×16, 4×30).

Screen Shot 2018-05-27 at 12.40.14 PM.png

Screen Shot 2018-05-27 at 12.48.42 PM.png

Screen Shot 2018-05-27 at 12.46.46 PM.png

  • Note: Container Ready Time can be approximated as:
    • Task request/scheduling time + image pull time + container startup time
    • Task Request/scheduling time, not shown in charts, ranged from 11-39s 
  • Container startup time is consistent across the board, hovering around 500-700ms.
  • Image pull rate is significantly better on instances with more vCPUs
  • Image pull rate is worse for instances with 2GB RAM than those with 16GB or 30GB

Overall Conclusions: 

  • Instance vCPU count seem to govern the network performance more so than RAM
    • More vCPU = Better
  • Instance RAM impact on overall performance plateaus over 8GB;
    • More RAM = Better (but less so than more vCPUs) 
  • Image size matters
    •  Smaller = better (smile) 

How will these findings affect our architectural design?

  • Given a container ready time in the neighborhood of ~1 minute (or multiple-tens of second), we’ll have to consider a pool of warm/standby containers in order to minimize wait and response time for our tasks.
  • Given the performance charts (storage latency, network and CPU decompression throughput), we’ll very likely consider a system with fewer larger instances than one with a lot of smaller ones.

 


If you are interested in Cribl, please visit us and download your copy to get started. If you’d like more details on installation or configuration, see our documentation. If you need assistance, please join us in Slack #cribl, tweet at us @cribl_io, or contact us via hello@cribl.io. We’d love to help you!

Enjoy it!  — The Cribl Team

Get Cribl LogStream Now!

5 thoughts on “Analyzing AWS Fargate

  1. Thanks for the article.

    Regarding the analysis of “container ready time”, the report states the task request/scheduling time “ranged from 11-39s”. What is this time value dependant on? In other words, how do containers with an 11s task request differ from containers with a 39s task request?

    I’m interested in knowing what can a developer do to reduce the task request/scheduling time.

    Thank you.

    Like

  2. Thanks for posting this detailed analysis! How did you measure the ‘Task request/scheduling time & image pull time & container startup time’?

    Like

    1. Hey Ben,
      At the time of the test, the task request/scheduling time was not available programmatically (not sure if it still is) so we had to resort in measuring it manually by refreshing the UI, hence, the variability and coarseness. For image pull start & stop time and container created & started times we had our tasks get the information listed here: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Task.html before concluding. Hope it helps.

      Like

      1. Awesome thank you! I couldn’t find anything around the task request/scheduling time either hence my question but this link is super helpful cheers.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s