Amazon Elastic MapReduce (EMR) is a tool designed for big data processing and analysis services. EMR is an Amazon Web Services (AWS) offering, but it is based on Apache Hadoop, which is a programming framework built for handling processing tasks of big data sets across distributed computing environments.
Amazon EMR can process big data across a Hadoop cluster of virtual servers running on Amazon Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2). EMR comes with dynamic resizing capabilities, which enable the system to increase or reduce resource usage according to current demand.
There are several ways to run Amazon EMR, each with its own pricing. EMR can run directly on Amazon EC2 or on Amazon Elastic Kubernetes Service (EKS), with the actual instances running on EC2 or Fargate. EMR is priced per second of usage, on top of the regular costs for EC2 compute instances, Fargate vCPUs, and other services needed to run EMR jobs, such as storage.
This is part of our series of articles on AWS pricing.
In this article, you will learn:
There are three models for running EMR - on Amazon EC2, on Outposts, which lets you run AWS resources on-premises, and on Amazon Elastic Kubernetes Service (EKS). For up to date pricing information refer to the official pricing page.
Amazon EMR is priced according to a per second rate baseline, billed in addition to the regular service prices. When deploying EMR on Amazon EC2, you pay for your chosen EC2 instance as well as for EMR processing.
For example, for an m4.16xlarge instance, the cost in the US East region is $3.20 per hour for the EC2 instance, and an additional $0.27 per hour for EMR (there is a corresponding EMR cost for all instance types).
You can choose from any of the regular EC2 instance pricing models, including reserved instances and spot instances. If you have Amazon EBS volumes attached to your EC2 instances, you are also charged for EBS storage.
AWS Outposts is a managed appliance that allows you to run AWS cloud services in your local data center. You can purchase a variety of AWS Outposts configurations featuring a combination of EC2 instance types, EBS gp2 volume, and S3 on Outposts. Pricing includes delivery of the appliance, installation, maintenance, and software updates.
Once you deploy an EC2 instance on AWS Outposts, the extra charge for running EMR on the instance is the same as in the Amazon cloud.
You can run EMR on Amazon Elastic Kubernetes Service (EKS) containers in two deployment models:
Here are a few tips and tricks you can use to save on Amazon EMR costs.
With EMR workloads it is a great idea to use AWS spot instances instead of on-demand instances. EC2 spot instances allow you to bid for unused capacity on Amazon EC2. The price you pay depends on the current supply and demand on the Amazon spot market.
The cost of using Spot Instances can be up to 90% lower than equivalent on-demand Instances. However, you need to carefully manage spot instances. Spot instances are terminated at short notice, when the same type of instance is requested by on-demand, reserved instance or savings plans customers.
To improve resilience for EMR workloads, use the following strategies:
EMR itself does not offer reserved instance pricing, but if you need to run EMR workloads for long periods of time, you can use EC2 reserved instances.
This means you have the same pricing options as EC2 reserved instances. The main difference is that, in addition to the reserved instance price, you will have to pay an additional charge for EMR, associated with the EC2 instance type you choose.
You can commit to EC2 reserved instances for a 1-year or 3-year period. The following payment options are available:
EMR clusters are typically used to perform heavy computing tasks, so they tend to require powerful EC2 instances and multiple compute nodes. So upfront payment options can turn into a large investment.
Instead of starting a separate cluster for each EMR task, it is more efficient to create one system that shares the cluster among several smaller tasks.
Remember that EMR has a minimum billing period of one hour. If you have several short jobs taking a few minutes to run, perform all of them on the same cluster, to fill up an hour of usage. If you run each job on a separate cluster, each one will be billed as a full hour, even if actual execution time is much shorter.
Another advantage of sharing clusters is the time it takes to bootstrap a new EMR cluster. Prefer to use existing clusters because you’ll save bootstrapping time, and can use this time to run additional jobs.
AWS Auto Scaling is very useful for managing EMR clusters that run continuously over long periods of time. It can help you automatically adjust the cluster size to the jobs you are currently running. You can auto-scale at a resolution of 5 minutes, which is the time required to set up the EMR node.
Amazon EMR can programmatically scale up applications such as Apache Spark and Apache Hive, adding nodes to improve performance. Clusters can be scaled based on Amazon EMR CloudWatch metrics, including YARN utilization metrics.
The cost savings best practices above are not easy to implement. For example, to improve the resiliency of EMR workloads running on spot instances, you will need to use different instance types. However, doing so requires the configuring and managing multiple auto scaling groups.
It can take a major effort, and specialized technical expertise, to set up proper auto scaling that automatically provisions instances with the right configurations, minimal standup time and no human intervention.
Spot by NetApp can help AWS EMR users take advantage of these cost savings strategies automatically:
for up to 20 instances