How is AWS EMR Priced?
Amazon Elastic MapReduce (EMR) is a tool designed for big data processing and analysis services. EMR is an Amazon Web Services (AWS) offering, but it is based on Apache Hadoop, which is a programming framework built for handling processing tasks of big data sets across distributed computing environments.
Amazon EMR can process big data across a Hadoop cluster of virtual servers running on Amazon Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2). EMR comes with dynamic resizing capabilities, which enable the system to increase or reduce resource usage according to current demand.
There are several ways to run Amazon EMR, each with its own pricing. EMR can run directly on Amazon EC2 or on Amazon Elastic Kubernetes Service (EKS), with the actual instances running on EC2 or Fargate. EMR is priced per second of usage, on top of the regular costs for EC2 compute instances, Fargate vCPUs, and other services needed to run EMR jobs, such as storage.
This is part of our series of articles on AWS pricing.
In this article, you will learn:
- AWS EMR: Pricing for 3 Deployment Options
- AWS EMR Cost Optimization
- Amazon EMR Pricing Optimization with Spot by NetApp
AWS EMR: Pricing for 3 Deployment Options
There are three models for running EMR – on Amazon EC2, on Outposts, which lets you run AWS resources on-premises, and on Amazon Elastic Kubernetes Service (EKS). For up to date pricing information refer to the official pricing page.
EMR Pricing on Amazon EC2
Amazon EMR is priced according to a per second rate baseline, billed in addition to the regular service prices. When deploying EMR on Amazon EC2, you pay for your chosen EC2 instance as well as for EMR processing.
For example, for an m4.16xlarge instance, the cost in the US East region is $3.20 per hour for the EC2 instance, and an additional $0.27 per hour for EMR (there is a corresponding EMR cost for all instance types).
You can choose from any of the regular EC2 instance pricing models, including reserved instances and spot instances. If you have Amazon EBS volumes attached to your EC2 instances, you are also charged for EBS storage.
Related content: read our guide to AWS ECS pricing
EMR Pricing on AWS Outposts
AWS Outposts is a managed appliance that allows you to run AWS cloud services in your local data center. You can purchase a variety of AWS Outposts configurations featuring a combination of EC2 instance types, EBS gp2 volume, and S3 on Outposts. Pricing includes delivery of the appliance, installation, maintenance, and software updates.
Once you deploy an EC2 instance on AWS Outposts, the extra charge for running EMR on the instance is the same as in the Amazon cloud.
EMR Pricing on Amazon EKS
You can run EMR on Amazon Elastic Kubernetes Service (EKS) containers in two deployment models:
- EKS on Amazon EC2 – you pay for EC2 instance costs, with an additional charge of $0.10 per new EKS cluster, and an additional charge for EMR, according to the EC2 instance type. This is the same price as you would pay for EMR when running directly on Amazon EC2.
- EKS on Fargate – you pay for Fargate according to the number of virtual CPUs (vCPUs) and amount of RAM used required for EMR. Fargate bills for EMR workloads based on the resources used from the time the EMR application image starts downloading, until the EMR job completes and the Amazon EKS Pod terminates, with a minimum charge of one minute.
Related content: read our guide to Amazon Fargate pricing
AWS EMR Cost Optimization
Here are a few tips and tricks you can use to save on Amazon EMR costs.
AWS Spot Instances
With EMR workloads it is a great idea to use AWS spot instances instead of on-demand instances. EC2 spot instances allow you to bid for unused capacity on Amazon EC2. The price you pay depends on the current supply and demand on the Amazon spot market.
The cost of using Spot Instances can be up to 90% lower than equivalent on-demand Instances. However, you need to carefully manage spot instances. Spot instances are terminated at short notice, when the same type of instance is requested by on-demand, reserved instance or savings plans customers.
To improve resilience for EMR workloads, use the following strategies:
- Mix on-demand instances and spot instances—it is especially important to run an EMR master node on an on-demand instance to ensure the resilience of the cluster.
- Mix different instance types—to avoid having an entire cluster shut down in case of demand shifts, mix different instance types in the same cluster.
- Use a fallback mechanism—if a cluster using spot instances fails to launch, provide a backup mechanism that switches instance types to on demand, or look for spot instances with other instance types or running in other Amazon availability zones.
- Use EMR instance fleet—this is an Amazon feature that lets you mix spot and on-demand instances, and use up to five instance types in the same auto scaling group.
EMR Reserved Instances
EMR itself does not offer reserved instance pricing, but if you need to run EMR workloads for long periods of time, you can use EC2 reserved instances.
This means you have the same pricing options as EC2 reserved instances. The main difference is that, in addition to the reserved instance price, you will have to pay an additional charge for EMR, associated with the EC2 instance type you choose.
You can commit to EC2 reserved instances for a 1-year or 3-year period. The following payment options are available:
- All upfront—one-time payment for the entire reservation period, which provides the biggest discount.
- Partial upfront—part of the total paid in advance and the rest paid monthly during the reservation period.
- No upfront payment—a monthly fee with a commitment to continue using the instance for 1 or 3 years. This provides the smaller discount, but the difference can be as small as 3-5%.
EMR clusters are typically used to perform heavy computing tasks, so they tend to require powerful EC2 instances and multiple compute nodes. So upfront payment options can turn into a large investment.
EMR Cluster Sharing
Instead of starting a separate cluster for each EMR task, it is more efficient to create one system that shares the cluster among several smaller tasks.
Remember that EMR has a minimum billing period of one hour. If you have several short jobs taking a few minutes to run, perform all of them on the same cluster, to fill up an hour of usage. If you run each job on a separate cluster, each one will be billed as a full hour, even if actual execution time is much shorter.
Another advantage of sharing clusters is the time it takes to bootstrap a new EMR cluster. Prefer to use existing clusters because you’ll save bootstrapping time, and can use this time to run additional jobs.
EMR Auto Scaling
AWS Auto Scaling is very useful for managing EMR clusters that run continuously over long periods of time. It can help you automatically adjust the cluster size to the jobs you are currently running. You can auto-scale at a resolution of 5 minutes, which is the time required to set up the EMR node.
Amazon EMR can programmatically scale up applications such as Apache Spark and Apache Hive, adding nodes to improve performance. Clusters can be scaled based on Amazon EMR CloudWatch metrics, including YARN utilization metrics.
Amazon EMR Pricing Optimization with Spot by NetApp
The cost savings best practices above are not easy to implement. For example, to improve the resiliency of EMR workloads running on spot instances, you will need to use different instance types. However, doing so requires the configuring and managing multiple auto scaling groups.
It can take a major effort, and specialized technical expertise, to set up proper auto scaling that automatically provisions instances with the right configurations, minimal standup time and no human intervention.
Spot by NetApp can help AWS EMR users take advantage of these cost savings strategies automatically:
- Intelligently provision an optimal mix of spot, on-demand and reserved instances to keep clusters running at optimal performance
- Monitor and predict spot instance behavior, capacity, pricing and interruption rates to proactively replace at-risk spot instances
- Predictive auto scaling simplifies the process of defining scaling policies, and automatically scales to ensure workloads have the right capacity
- Manage different types of workloads on the same cluster and across AZs; use mixed instance types and sizes in the same node group