Introducing the FinOps engineer

Infrastructure Financial Engineer at Wix

May 24, 2020

8 min read

It’s 2020 and running infrastructure on the cloud has become quite standard for many companies. This is underscored by a Canalys report that shows 2019 global cloud spend at $107 billion. This spend on cloud services is spread across AWS with 32.4% of the market, Azure at 17.6%, Google Cloud at 6%, Alibaba Cloud at 5.4%, and other smaller cloud players with a combined 38.5% of market share.

Yet despite all the growth, there are many challenges deploying infrastructure on the cloud, of which perhaps the biggest (after typically receiving increasingly large cloud bills) is the financial aspect.

In this post, based on our experience at Wix, we will introduce the role of FinOps engineers along with some of the technical and financial approaches they employ for successful cloud cost management.

What really is driving your cloud bill sky-high

It has been said that the cost of your cloud infrastructure is not a derivative of the number of customers you have, but rather the number of engineers. The mere introduction of a new feature, even without any additional customers, can significantly increase costs simply because of its architectural design, resource utilization and the like.

To ensure that rapid innovation and increasing scale remains within budget, a new role has been born, that of the FinOps engineer. The FinOps engineer combines intimate knowledge of software development requirements combined with an expertise in finding optimal and cost-efficient solutions for deployment in the cloud.

As this field is growing and starting to better define itself, we can point to the main areas of responsibility of the FinOps engineer:

Creating full visibility for any financial aspect related to the company’s hosting solutions
Creating governance on spend on a daily basis
Reviewing architectures and deployment topologies
Creating optimization tasks and data driven recommendations
Working closely with the finance team for budgeting and forecasting activities
Establish and maintain the connection between the operations, procurement and finance teams

These FinOps activities require companies to culturally adapt as they have in the past for DevOps and agile development methodologies.

The goal for FinOps engineers is to create visibility in the organization for our cloud spend, lead the technical optimization to improve performance and efficiency and reduce the cost while scaling activity up. For that FinOps engineers need to work on two fronts:

Architecture and deployment topology
Financial tools and reports

Let’s explore each one.

Cloud architecture, deployment topology and how money makes all the decisions

When we approach the exciting assignment of deploying (or reviewing) our infrastructure in the cloud, we often face a great number of options. With the phrase “anything is technically possible” often thrown around, it is sometimes very hard to understand and choose the right architecture for your organization. Common practice in the cloud is to choose between dozens of instance types and sizes, several storage options, different network architectures based on regions and availability zones, and much more (e.g. managed services, licenses, marketplace offerings etc.).

Surprisingly, when we add cost as a KPI for cloud architectural excellence, things become a lot clearer when choosing which tools we want to use. Let’s look at some examples.

Choosing the right EBS volumes

When you need fast storage for your instance, many folks will tell you to use provisioned IOPS SSD (io1) volumes. If you require only the IOPs and can live with less throughput, deploying a larger GP2 volume that supports the IOPs you need, will always be more affordable.

Choosing the right instance types

Sometimes with the huge selection of instances, we can get lost in choosing the right one for us. Things get more complicated when we’re not just looking at the general information of the instance (number of cores, RAM, IO) but rather the actual hardware and performance metrics to check how well the instance performs.

But with patience and research, you can find the computing power you need, at the right price. For example, when AWS released the c5 family we got really excited as it was not only 15% less expensive than the c4 family, it also came with Skylake CPUs, more memory compared and better IO. In practical terms for one thousand c5.2xlarge hosts deployed, you could save in comparison to the older c4.2xlarge over on-demand prices over $40,000 monthly. And that’s without even talking about reserved instances.

Another incredible example we can talk about is the release of the c5n type that can reach up to 25 Gbps, something that was barely possible on the biggest sizes of previous generations.

For simulations, in-memory caches, data lakes, and other communication-intensive applications the savings are absolutely astronomical. For example, you can enjoy even better performance with the c5n.large than you would with the older c4.8xlarge workloads AND you would pay 94% less, with the c5n.large costing only around $79/month and the c4.8xlarge still super pricey at around $1,164/month (in US East-1).

Choosing the right locations

Here things get a bit trickier, as it’s no longer only dependent on technical and financial factors. Things like proximity to your users, GDPR regulations, privacy, security, on-premise locations that need a connection to the cloud and other issues, all need to be considered.

While cloud providers offer several “regions” where you can deploy, each region typically has different pricing. Sometimes deploying the exact same infrastructure in a specific region can translate into thousands of dollars in monthly savings. Within regions themselves, there are usually multiple availability zones, each representing a different AWS data center.

Based on this, there are several architectures you can use, such as “single availability zone – multiple regions”, “multiple availability zone – single region”, “multiple availability zone – multiple regions” and the like. When you’re deploying your infrastructure based on these architectures you need to also consider the availability of your services, how fast you can shift traffic, the hardware capacity of the desired availability zone and more.

For example, running on a single availability zone might damage your HA architecture as you might face a capacity limit or availability zone downtime, but the financial benefits can be really great as traffic in the same availability is free – definitely something to consider.

Let’s now take a look at some available cloud tools and reports for better managing your cloud costs.

Cloud knowledge is cloud power

To gain visibility into your cloud spend you can use several tools and reports to investigate. I’ll be focusing on AWS and GCP as Wix is mainly working with these two vendors.

AWS Tools

You have “out-of-the-box” tools such as Cost Explorer to show graphs and trends, Trusted Advisor which provides right-sizing recommendations and other cost recommendations such as purchasing plans (reservations, Saving Plans, spot instances, etc.), budgets alerts and more.

The most powerful tool you can use for cost management are tags. Cost allocation tags can create complete visibility on your infrastructure, separating a single account activity into business units, service descriptions, agenda owners, and more. You can enable the tags that are interesting for you in the cost reports thereby gaining huge insights into your organization’s cloud activities.

For more technical engineers that want to automate things, you have the cost explorer API, and SDK libraries supporting Go, Python, PHP, Ruby, .NET, Java and more to connect to your EC2 activities, CloudWatch metrics, etc.

AWS Reports

DBR or Detailed billing report – this is the legacy report that contains your usage information
CUR or Cost and allocation report – this is the main billing report that contains not only usage and cost information, but also technical information as well (from the CPU clock speed and the number of virtual cores), as well as details on any existing reservations and Saving Plans.

Both reports are exported to S3 and can be used as a data source for your own dashboarding tool.

GCP Tools

Google has a “Compute Engine rightsizing recommendation” out of the box and a “Report” page with a utilization graph where you can see your activities. It also has different purchasing plans like committed usage (similar to AWS’s reservations and Savings Plans) and Preemptible machines (like AWS’s spot instances). You can manage your budget and alerts with automated budget actions, billing export and billing APIs.

Google also supports tags (known as labels in GCP), but has somewhat limited capabilities as Keys and values can contain only lowercase letters, numeric characters, underscores, and dashes. All characters must use UTF-8 encoding, etc.

Google also has client libraries integrated with the major services of GCP supporting Go, Java, Node.js, Python, Ruby and more.

GCP Reports

GCP doesn’t support a traditional report object, and the billing information is written to BigQuery among several billing tables that are connected. If you want to use all the detailed information on your billing you’ll have to query BQ to get the relevant billing information.

FinOps is forever

FinOps (or how we call it at Wix – Cloud Financial Engineering) is not a one-time effort, but an ongoing, challenging, day-to-day effort.

We have just done a basic overview of some the most common questions you need to ask when evaluating your deployment architecture. As we have shown, it’s not enough just knowing what will solve your problem technically. You really need to think about what is the most cost-effective solution available before making a decision. Likewise, we have covered some of the more basic tools and reports available that you need to be familiar with to understand your cloud spend.

But this is just the beginning. The value that a FinOps engineer brings to a company is huge and plays an integral part of either a CCOE (cloud center of excellence) or any other team dedicated to monitor, govern and set KPIs for organizations’ cloud expenses.

The best part is that you don’t need to know it all or do it all yourself.

Many companies, such as Spot.io can provide the knowledge and automations to help you succeed in this journey, from creating deep cost visibility to actually running your infrastructure for you in the most optimal way, so you can enjoy the savings without having to worry about availability or needing to get involved in managing all the underlying infrastructure yourself.

If you have read this post until here, you have already taken your first step in your journey to FinOps excellence. Safe travels!