Guest authors Marcin Cuber, Senior Cloud DevOps Engineer, and Kerry Kamil, Head of Digital Infrastructure, at News UK explain why they turned to managed Kubernetes from Amazon Web Services and infrastructure monitoring from New Relic for faster time to market. Marcin was the architect of News UK’s Amazon EKS environment.
With the goal of educating, entertaining, informing, and inspiring its readers, News UK’s newspapers and digital products include some of the most powerful media brands in the English-speaking world, reaching over 31 million people each week. Our brands include national newspapers—The Sun, The Times, The Sunday Times, and The TLS—as well as Unruly, a global ad tech company, and Wireless, a leading UK and Irish media company boasting independent local and national radio stations such as talkSPORT, talkRADIO, and Virgin Radio.
The ultra-competitive media landscape we’re working in changes extremely rapidly. That means that technology organizations like ours at News UK must be able to adapt and innovate at high speed. That’s why we’ve embraced DevOps and cloud native technologies to help us roll out new features, platforms, and products continuously.
Speed isn’t our only challenge; we also need our applications and environment to be instantly scalable and reliable. If there’s a breaking news event, we have to make sure that every reader can get updates on that event in real time, with a flawless experience. At the same time, we must find ways to deliver a great experience as cost effectively as possible.
We’ve been using Kubernetes to orchestrate our container environment on Amazon Web Services (AWS) for two years now. Why did we choose containers and Kubernetes to manage them? Because they help us meet all of our challenges by allowing us to:
- Scale easily: Whether we’re introducing a new product, delivering breaking news, growing our readership, or some other growth or peak usage event, we can scale our environment on demand.
- Deploy faster: Containers and Kubernetes help us accelerate the application pipeline and get new capabilities to market faster.
- Take advantage of open source technologies: Using open source helps us attract modern engineers that care about cloud native technologies. We also benefit from the vast ecosystem of other open source tools that work with Kubernetes.
- Segment applications into microservices: A microservices architecture significantly increases the overall agility and maintainability of our applications, and Kubernetes allows us to package microservices as containers for reproducibility, transparency, and resource isolation.
Moving to Kubernetes-as-a-Service
While our custom Kubernetes clusters have been a boon to our DevOps teams, deploying, repairing, and managing them was taking increasingly more time and effort. That’s why we decided to move to Amazon Elastic Kubernetes Service (EKS). Amazon EKS provides us with a stable, managed Kubernetes control plane which includes managed masters and etcd layers. Operationally, this greatly simplifies what we need to maintain and monitor.
Amazon EKS runs in its own virtual private cloud (VPC), allowing us to define our own security groups and network access control lists (ACLs). This provides us with a high level of isolation and helps us build highly secure and reliable applications.
As we migrate teams from our custom Kubernetes environment to our EKS environment, we’re currently running on four Amazon Elastic Compute Cloud (EC2) M5.4xlarge instances, which gives us 234 pods that we can run inside of each node. While we currently have a maximum of 10 nodes, they are large enough for 50 to 100 applications, and can autoscale easily. (For those interested in seeing the entire EKS configuration written in Terraform and CloudFormation, you can find it at AWS EKS – kubernetes project.)
Using templates to support fast iteration
We fully templated the deployment of the infrastructure, including custom VPC configuration, EKS control plane, worker nodes, and bastion host using Terraform and CloudFormation. For example, we automatically update worker nodes when a new version of the AMI is available.
We used CloudFormation to automate the process and give us the option to perform rolling updates on the worker nodes. We then use a lifecycle hook to mark each instance that is going to be terminated. At that point, we have an AWS Lambda function that listens to these events and gracefully drains nodes.
(For template examples, check out posts at Marcin’s GitHub account.)
Controlling costs and capacity
We’re using Amazon EC2 Spot instances in our EKS cluster to keep costs down. Spot instances allow you to purchase unused spare compute capacity. This provides us with a substantial cost benefit compared to on-demand pricing, saving us roughly 70%.
However, to use Spot instances, you must make sure that your cluster can scale quickly and that workloads are always fully functioning. Because a Spot instance can be reclaimed by AWS, you need to gracefully drain nodes before they get terminated during interruptions. We solved this by implementing custom interruption handling, which waits for a termination signal and then drains the node.
To manage capacity, we define a resource quota for all namespaces by specifying the CPU and memory that can be consumed. When several users or teams share a cluster with a fixed number of nodes, there is a concern that one team could use more than its fair share of resources. Resource quotas are there to address this concern. Within a namespace, we use resource limits at the pod level to guarantee that one pod or container does not consume all resources from the namespace.
Gaining clear visibility within the cluster
To fully understand what’s going on in our environment, we use New Relic Kubernetes cluster explorer, which is part of New Relic Infrastructure. It gives us visibility into our EKS infrastructure and monitors the health of our AWS environment.
We also use New Relic APM, which is fully integrated with the Kubernetes cluster explorer, allowing our DevOps teams to understand whether an underlying infrastructure issue is impacting performance of their application. In the past, we’ve had applications max out the I/O of a particular host, while the memory and CPU were unaffected. New Relic gives our infrastructure and application teams the ability to dive through the entire stack, diagnose the issue, and then resolve it as rapidly as possible.
We use New Relic Alerts for notifications when something is not working as expected, which is applicable to both infrastructure and pods running inside the cluster. We also send Kubernetes logs and events to New Relic, which allows us to have all the information about our cluster in a centralized place. EKS control plane logs, which are normally exported to CloudWatch, are also sent to New Relic.As we move our custom Kubernetes environments to EKS, the DevOps teams will find it much easier and faster to deploy software. With New Relic, they’ll have full visibility into the health of the infrastructure and how their microservices and applications are performing. Once all the applications are running in EKS, we’ll decommission the existing Kubernetes clusters, which will further reduce our infrastructure costs.
For more explanation of some of the topics mentioned here, read these articles from Marcin: