AWS Well-Architected Framework Applied: Operational Excellence

Published in

Good Audience

4 min readJul 23, 2019

According to the TechJury 2019 report, the AWS is still the most popular cloud provider, and it makes a question of building an AWS infrastructure more popular than ever. But, unfortunately, the AWS Well-Architected Framework (AWS WAFR) and related to it materials have way more theory, than real experience and examples.

So, in this article series we’ll explain how (AWS WAFR) applies to the real-world web applications, showing usage of each pillar on the specific application we’ve developed recently.

This article focuses on the first pillar, defined by AWS as “Operational Excellence”.

The Application

First of all, we need to give you some information about the application.

All the examples, shown in these series, are taken from the architecture and the code of the collaborative platform being developed in Quantum right now.

The purpose of this project is to create a platform where the owners of the unique and specific equipment, services and expertise could communicate with startups, thereby engaging the necessary resources to implement innovative technological ideas.

The web application contains plenty of features, which help platform users communicating with each other: personal and project chats, project forums, social media-like connections, publishing documents and media files. The system has a comprehensive notifications system (internal and by email), which is the basis for the core collaboration platform workflows: connections between people, invitations to project managers and security managers, invitations for people to join various social and business initiatives, necessary resource requests, notifications about new posts and messages, etc. Via platform capabilities, you can own your data, and it’s secure. You can even assign a Security Manager to review content you put on site, which adds another layer of protection for your digital intellectual property.

From the perspective of the tech stack, backing all these features, it’s pretty simple: Python/Django on the server-side, along with PostgreSQL for database, React.JS/Redux on the front-end, plus ElasticSearch for speeding up the search across different types of data.

Knowing all the requirements, both business and technical, we can start applying WAFR to our application, so let’s talk about the first pillar: Operational Excellence.

Operational Excellence: Definition and Design Principles

Amazon defines operational excellence as “the ability to run systems and gain insight into their operations in order to deliver business value and to continually improve supporting processes and procedures” in its whitepaper. Speaking simple, an application that’s built with the operational excellence in mind should be able to update, scale, collect business insights and tech metrics, notify the team about everything unusual (servers down, application exceptions, intrusion attempts) in a well-documented, automated manner with all the needed setup defined in code. And these code changes should be performed in small portions, documenting each step and each change. That seems like a lot of requirements, but with AWS it can be implemented pretty easily.

Our application of Operational Excellence

Quantum uses the following main services for maintaining Operational Excellence: CloudFormation, CloudWatch/CloudWatch Agent/CloudTrail and AWS Lambda. Let’s discuss each one in details.

CloudFormation (CF): all the infrastructure is managed and updated via CloudFormation templates, stored in a Git repository. It helps track and document each change and allows to perform code reviews for each change. All of the resources, like EC2, RDS, S3, ELB, etc. are managed solely via CloudFormation templates, no manual setups. Configuration of the OS on EC2 instances is performed by passing User-Data through CF templates.
CloudWatch/CloudWatch Agent and CloudTrail: CloudWatch is used for monitoring EC2 level metrics(CPU load, IO load, net load, etc.), and application-level metrics(request rate, error codes, etc.). Notifications about drastic changes in service load or any incidents are sent via SNS to e-mails of responsible engineers in case of an incident, and to the AWS Lambda function in case of a need to scale up or down. We also use CloudTrail, which helps us track accesses and actions on AWS console and AWS API. This info is analyzed to prevent unauthorized intrusion or situation of uncontrollable scaling.
Cloud Trail is used to detect and prevent cases of suspicious access attempts and uncontrollable scaling of a resource.
AWS Lambda: we are using the AWS Lambda for automated scaling of the infrastructure. Changes are triggered via SNS, in response to them, Lambda function launches a new CloudFormation template or destroys redundant for current load, updates Route 53, sends notification about the results to the responsible engineer.

Possible improvements

Making use of all the best practices, contained in WAFR, in your application — an evergoing task that can take an indefinite amount of time, so all of the above are the things we’ve implemented from the start. Right now we are also considering using the AWS Trusted Advisor to see what else can be improved in the infrastructure, and we have an idea to move from Bitbucket and Jenkins to CodeCommit and CodePipeline to keep all of our services on Amazon.

This is the end of our first AWS WAFR article, thanks for your time and interest!

The next part is coming soon and it’ll be about the Security pillar of our application.

P.S.: Share your ideas and insights in comments, tell us what else you would like to read about.

All the best, your Quantum team.

Sources

Amazon Operational Excellence Whitepaper
Explanation of all AWS WAFR pillars on Devopedia — pretty nice article, explaining the whole concept of AWS WAFR in abstract
The AWS Well Architected Framework in a Nutshell

Written by Danil Lytovchenko
Proofread by Igor Korotach