AWS and complex costs: an example

By Eelko de Vos

AWS has tons of ways to bill you for their great (no sarcasm here!) services. However, little changes on their side can cost you tons of money if you don’t pay attention.

The Crime

We experienced this again just a few days ago. The costs of AWS Config had risen dramatically. AWS Config is a system that keeps track of all changes of your AWS resources. Each config change is recorded and stored. And each record will eat up some of your budget. Not much, but a little.

The Suspects

Around the same time our company had started to work with AWS Control Tower. As Control Tower is thoroughly linked to AWS Config, we suspected that Control Tower had somehow increased the number of AWS resource changes.

We had not changed our MO nor had we introduced anything that might have caused the costs for AWS Config to increase. But still, the costs were there.

The Investigation

When we deep-dived into the actual records we found that on the EC2 servers something called the “ssm-agent” was doing odd stuff. It was ping-ponging between two versions every few minutes for each and every EC2. This caused tons of config-records to be created in AWS Config and seemed the cause for our cost increase. But what was happening here?

We had followed AWS’s guidelines on how to install the ssm-agent by the letter. There are numerous pages at docs.aws.amazon.com that shows you how to script or manually install this agent. So we did include this in our deployment pipelines as a separate argocd-application.

But somehow this second ssm-agent crept up into all our EC2s. Where did it come from?

The Culprit

A quick search with the right terms popped up an AWS AMI release page. AWS AMI release 20210621 suddenly had a different “notable change” than usual. Usually these releases contain security updates and kernel-based updates. But now it showed this change in the AMI Changelog:

  • The SSM Agent will now be automatically installed

This caused the two ssm-agent to fight for the configurations on the EC2s, resulting in thousands and thousands of configuration records in AWS Config per day. When we rolled out new nodes for our clusters -usually through autoscaling- each one of these new EC2s started to suffer from this problem. And when we updated entire nodegroups the effect was more than significant.

AWS Cost Explorer showed that the costs indeed increased sharply with each time nodegroups we updated. In the end, the costs for AWS Config went up a thousandfold.

The Remedy

The remedy was easy: uninstall our ssm-agent and from now on rely on the one in the EC2s. The costs dropped sharply and immediately.

The Next Step

The next step is to contact AWS Support and negotiate on the costs of this problem that AWS themselves introduced. Luckily I’m an engineer, not a commercial manager. So we leave that part up to them so we can dive into the next intriguing technical issue.

A second thing we learned is not to trust on common changes from your partners. Sometimes they contain uncommon ones that definitely require your attention. Or to quote from an old but great movie: “Big things have small beginnings.” So you’d better pay attention to these small beginnings, no matter how seemingly insignificant.

Guston is a Burgundian living in Oosterhout (Noord-Brabant, NL), a former soccer player, husband, and proud father of Maud and Ties. He is also proud (co) owner of Fullstaq and Bryte Blue. Basically, his two other kids. He is a result-driven person and loves connecting people in the IT sector, specializing in the Open Source, DevOps, and Cloud-Native community. Guston has a focus on recruitment, marketing, and sales. Finding talent, new business development, and maintaining (key) accounts is his main focus.

Guston started working in the IT (Open Source) sector in 2009 and has built up his network (Clients and Engineers) since then. Since 2019 he has been a proud co-owner of Fullstaq, and since 2021 of Bryte Blue, a new label with a sole focus on Azure.
May 07, 2024 | BLOG | 6 MINUTES

8 questions you were afraid to ask about Talos answerd

Talos is a minimal Kubernetes OS that's quickly gaining popularity because of its ease of use and strong focus on security by default. It has already been …

April 30, 2024 | BLOG | 9 MINUTES

12 Factor: 13 years later

How can we make applications easy to operate? The 12-factor methodology is about 13 years old. How did it age in the cloud-native era? Do we need a 13th …

April 25, 2024 | BLOG | 5 MINUTES

Build your own Python Kubernetes Operator

Yes, you read it right – build a K8s operator in Python! I often get reactions like, "But doesn't it have to be in Golang?" Fortunately, that's not …