What I learned as an AWS rookie.
A couple of years ago I made my first steps into the AWS universe. After several battles with a single LAMP VPS we decided it was time to move on from that single point of failure to more scalable platform.
Now, in 2016, after years of working with AWS (among other cloud platforms) I consider myself an advanced user, but that wasn’t the case when I started building the platform 2 years ago. Not only did I learn a lot about AWS, but also the ‘cloud’ way of designing and building infrastructure.
Among a bunch of minor details there are important lessons that might help you understand the way AWS sees things and might help you get started.
Design for failure.
Before getting in depth about ‘designing for failure’ it is important to understand that this is not necessarily limited to working with AWS. It is a way of designing your infrastructure that will help you be successful within any cloud provider or platform.
Although the hardware infrastructure a cloud provider uses is probably far more reliable then you could ever do yourself, it does not mean that it cannot break or go down. A good cloud provider (like AWS) has not been build upon the idea to be a 100% up, it is build upon the idea to give you the opportunity and resources to work around hardware and/or network failure. This is where the term ‘Design for failure’ comes in to play. It is way of designing your infrastructure around the simple fact that every resource could fail. That does not only apply to a simple EC2 node, but also the fact that an entire availability zone can go down.
Amazon Web Services offer a multi-AZ option for many of their services. It is build-in functionality that detects failure in an availability zone and automatically activate a passive replica in another location. The RDS service of AWS is an example of that, it keeps a passive replica of the database in a different AZ.
The AWS ELB (Elastic Load Balancing) service also offer a way of load balancing traffic among multiple AZ’s. In combination with an auto-scaling group that is configured to deploy EC2 instances among all AZ’s is a winning combination!
Amazon works with a pay-as-you-go model, just as most other cloud providers. It is simple model that makes you pay for resources only when they are active. Although the pricing of services seem very low, at the end of the month you will have a significant bill to pay, especially if you have a great ‘Design for failure’ mindset.
To keep your platform at a decent price range it is important to figure out how many active resources you need at what specific time. Some money saving tips:
Are you running a test environment on AWS? Does it have to be online after office hours? Probably not!
Does your application have peak usage at night? Lower the amount of active resources during the quiet periods. Auto-scaling groups offer a time-based scheduling, very handy!
Does that service really need the multi-AZ option? Or can your application survive without it for a while? Caching mechanisms are a good example.
The cost-effectiveness of your infrastructure is a challenge you never thought facing, but you will love it!
When you finally convinced your manager to start using the AWS cloud infrastructure one of the many things he is going to discuss with you is the ‘exit-strategy’. When your company or application is hit with some bad experiences you could potentially be faced with migrating away from AWS. That’s why you want your ‘vendor lock-in’ to be at a minimum.
I have learned that if you want to use the AWS cloud to the fullest, there will be some minor vendor lock-in, it is a simple fact you have to accept. Not only your application could benefit from using the rich AWS API, but also your infrastructure-as-code (Chef, Puppet, Ansible, etc.) will profit from using it. Fortunately many services by AWS are based on Open Source applications. An example of that is ElastiCache, which is based upon Redis and is a drop-in replacement.
Traditionally you (the System Engineer) would be in charge of implementing the application into infrastructure, in this case the application must also fit the cloud platform. This is a situation where Dev really meets Ops!
Auto-scaling is cool, but mostly too late.
The Auto-scaling functionality within AWS has been one of the best toys I have played with in years, although there are pitfalls.
Auto-scaling works very well in steadily increasing workloads. Simple batch-processing is an example of that. In our case we had a normal steady amount of visitors. But with marketing campaigns (especially social media) the amount of visitors would skyrocket to 10–20 times more within a couple of minutes. Because we want cost-effectiveness we don’t want to over-provision, which means we had enough resources active for ‘normal’ usage. Auto-scaling would kick-in after a couple of minutes, but the combination of the grace period and the amount of time a node would need to provision, the result would be downtime of the website. That is especially bad when your marketing campaign was hitting the top-10 of posts on the frontpage of your followers, downtime is killing!
In order to work around the fact that auto-scaling was always too late in our specific situation we implemented a custom marketing calendar. It is an application that offers the marketing personnel to schedule a marketing campaign and assign an estimated ‘weight’. A scheduled task would pick up the estimation and schedule the appropriate amount of nodes to be provisioned by the auto-scaling group. With this implementation we are now able to provision enough resources before the actual campaign starts and has saved us many times! This is an example of a little bit of vendor lock-in that you can afford.
Pretty good support!
I don’t have very good experiences with the support department of very large companies, fortunately I was pleasantly surprised by the support department of AWS. Most of my support request are getting handled within an hour with an in-depth answer from one of the engineers. Pretty great!