Big data applications require distributed systems to process, store and analyze the massive amounts of information that companies are collecting. Apache Spark has become a go-to framework for this, powering use cases from AI and machine learning to data analysis, by providing a unified interface for distributing data processing tasks across a cluster of machines. Spark requires other services to manage the cluster, with YARN and Mesos as two well-known cluster management tools. Recently, Kubernetes has become an increasingly common way to manage big data clusters in the cloud, affording users the agility, speed and scalability of microservices.
While users will benefit from deploying their applications with Kubernetes, which is ideal for big data workloads for many reasons, there are also complexities introduced by Kubernetes that present real challenges to building a fully optimized big data platform.
Read on to learn more about some of the challenges you’ll face with building a cloud native Spark application with Kubernetes, and how to overcome them.
Time and expertise
In order to run Spark on Kubernetes with reliability and high-performance—especially in production—setting up and configuring big data infrastructure to handle the scale of the system is critical. This, however, is not easy or straightforward, requiring significant time and expertise. From creating the K8s cluster and its nodes, to setting up autoscaling, Kubernetes is full of complexities that can make just getting started a challenge for novices. Data scientists don’t want anything to do with managing infrastructure, and since Kubernetes is relatively nascent in the space, data operations teams are left with a steep learning curve. DevOps teams, on the other hand, do have experience with Kubernetes but still struggle with its complexities and have limited knowledge of big data and machine learning tools.
For large-scale big data applications, choosing a DIY Kubernetes deployment, instead of managed services like EKS or GKE, can offer users more flexibility and control over clusters, but is difficult to standup and implement, especially for a team with little Kubernetes experience. A broad and complex set of components, plugins and services need to be configured while taking into account specific requirements and needs of the application that will impact the installation. For Spark applications, users will not only need to set up Kubernetes and figure out how they will monitor and troubleshoot their application, but also how they will deploy a Spark operator, connect Jupyter notebooks (or another IDE), and build and manage workflows.
Maintainers of self-managed Kubernetes deployments also need an understanding of cloud pricing models for compute and need to be able to choose from a variety of on-demand, reserved and spot instances in order to maximize data processing while controlling costs. The nuances of these pricing models, however, make purchasing decisions difficult and can often have a direct impact on application performance.
Dynamic infrastructure management
The resource demands of big data systems require flexible infrastructure to match the scalability of the containers running on it. Kubernetes will scale pods when applications’ requests come in, but it doesn’t scale the underlying infrastructure. As long as there is a healthy node to run on with enough capacity, Kubernetes will schedule it. This often results in over-provisioned pods running on more capacity than they need, which can leave an imbalance between limited cluster resources and ever-increasing compute demands of big data. Optimizing this in a DIY way can become a never-ending trial and error task.
The architecture of Spark running on Kubernetes also often requires users to implement automated processes to ensure reliability of the system. When a job is submitted, a Spark driver pod is created, which launches executor pods to carry out tasks. If an executor pod fails, the driver pod is responsible for replacing it. However, if the driver pod fails it can bring down your application. Users will have to troubleshoot the failure and restart pods. To avoid this, it is best practice to have automation built in (i.e. lambda function, Airflow) to handle failures, requiring heavy lifting from data engineers and scientists to learn, deploy, manage, and scale workflow tooling on top of this architecture.
Cost control with spot instances
Processing large amounts of data is often resource intensive and managing the demands and requirements of a big data system can be costly. While cloud providers offer spot instances as a cost-efficient way to consume cloud capacity, unlocking this cost savings is not straightforward.
By their nature, spot instances are interruptible. If a Spark job is disrupted by a spot instance termination, the tasks executed on the instance can be lost, and application owners will need to restart the job from the beginning. This adds to already complex cloud environments, and users must solve for several challenges in order to ensure the reliability of Spark applications running on spot instances.
Application-aware selection of spot markets in real time
Having the flexibility to run workloads on a mix of spot, on demand and reserved instances increases application reliability, performance and cost efficiency. Determining which model and market to use depends on the application itself—is it fault tolerant? Does it experience frequent traffic spikes?
For applications that run on spot instances, it’s important to pick markets that are least likely to be interrupted. It is paramount to understand the real time availability of each instance market (instance type, size and availability zone) and ensure that workloads run on the most stable markets. However, choosing instances based on real-time application requirements, not on pre-configured targets, is an operational burden that involves actively monitoring markets and detailed understanding of specific application infrastructure needs.
Application resiliency on an interruption-prone infrastructure
With the risk of interruption, applications need to be fault tolerant. There are best practices for architecting reliable applications, for example, a microservices approach helps to prevent failure from impacting the entire application. To reliably run workloads on spot instances, it is recommended to spread instances across availability zones and use multiple instance types. This strategy however adds operational tasks, including the need to manage autoscaling groups for each availability zone. This becomes more complex if you want to use instances of different sizes as well, as a single ASG will not support mixed instance types with different CPU and RAM.
Ensuring instantaneous Spark batch, streaming and ML application processing
When a big data project expands, a common bottleneck is insufficient resources (CPU/RAM/GPU), resulting in low performance or even disruption of service. While operating in the cloud enables users to spin up more resources when they need to scale, there is no guarantee of how quickly newly provision infrastructure will be ready and healthy for the application to run on. This results in pod scheduling delays and an increase in job processing time. Users can maintain extra capacity, or headroom, to prepare infrastructure for potential scaling so applications don’t have to wait to for new capacity. Determining the appropriate amount of headroom and limiting wasted resources can often be a game of trial and error, requiring users to have a good understanding of resource utilization, historical scaling trends and other factors.
Avoiding the pitfalls
There’s no doubt that using Kubernetes to run Spark applications has significant benefits. However, even with the increasing adoption of Kubernetes for big data, many of these challenges have yet to be solved for. In upcoming posts, we’ll tackle some of these issues, including how to run your Spark application on spot instances with resiliency and efficiency.