AWS and complex costs: an example

By Eelko de Vos

AWS has tons of ways to bill you for their great (no sarcasm here!) services. However, little changes on their side can cost you tons of money if you don’t pay attention.

The Crime

We experienced this again just a few days ago. The costs of AWS Config had risen dramatically. AWS Config is a system that keeps track of all changes of your AWS resources. Each config change is recorded and stored. And each record will eat up some of your budget. Not much, but a little.

The Suspects

Around the same time our company had started to work with AWS Control Tower. As Control Tower is thoroughly linked to AWS Config, we suspected that Control Tower had somehow increased the number of AWS resource changes.

We had not changed our MO nor had we introduced anything that might have caused the costs for AWS Config to increase. But still, the costs were there.

The Investigation

When we deep-dived into the actual records we found that on the EC2 servers something called the “ssm-agent” was doing odd stuff. It was ping-ponging between two versions every few minutes for each and every EC2. This caused tons of config-records to be created in AWS Config and seemed the cause for our cost increase. But what was happening here?

We had followed AWS’s guidelines on how to install the ssm-agent by the letter. There are numerous pages at docs.aws.amazon.com that shows you how to script or manually install this agent. So we did include this in our deployment pipelines as a separate argocd-application.

But somehow this second ssm-agent crept up into all our EC2s. Where did it come from?

The Culprit

A quick search with the right terms popped up an AWS AMI release page. AWS AMI release 20210621 suddenly had a different “notable change” than usual. Usually these releases contain security updates and kernel-based updates. But now it showed this change in the AMI Changelog:

  • The SSM Agent will now be automatically installed

This caused the two ssm-agent to fight for the configurations on the EC2s, resulting in thousands and thousands of configuration records in AWS Config per day. When we rolled out new nodes for our clusters -usually through autoscaling- each one of these new EC2s started to suffer from this problem. And when we updated entire nodegroups the effect was more than significant.

AWS Cost Explorer showed that the costs indeed increased sharply with each time nodegroups we updated. In the end, the costs for AWS Config went up a thousandfold.

The Remedy

The remedy was easy: uninstall our ssm-agent and from now on rely on the one in the EC2s. The costs dropped sharply and immediately.

The Next Step

The next step is to contact AWS Support and negotiate on the costs of this problem that AWS themselves introduced. Luckily I’m an engineer, not a commercial manager. So we leave that part up to them so we can dive into the next intriguing technical issue.

A second thing we learned is not to trust on common changes from your partners. Sometimes they contain uncommon ones that definitely require your attention. Or to quote from an old but great movie: “Big things have small beginnings.” So you’d better pay attention to these small beginnings, no matter how seemingly insignificant.