Powering Progressive Deployment in Kubernetes with NGINX and Argo Rollouts

Recently, the Kubernetes community celebrated the platform's 10th year in existence. Inarguably, it has come a long way since its early days at Google and has established itself as the de-facto platform for delivering cloud-native modern applications. However, will it intrinsically solve all your delivery challenges out-of-the-box? In this article, we will be leveraging ecosystem tooling to improve the quality and frequency of your application deployments in Kubernetes.

In your application modernization journey, you may have discovered there are still inherent complexities that remain unsolved for you. For instance, you may find it relatively easy to set up your first production cluster, deploy your applications to it, and start accepting live traffic to it. You may wonder:

How is your application performing?
What are users experiencing when they consume your applications and APIs?
When you inevitably need to update your application, how do you do so while minimizing disruption?

If you can identify with at least one of the above concerns, you are not alone. By the end of this article, you will understand how to:

Enhance Traffic Management: The ability to make dynamic routing adjustments during deployments.
Increased Availability and Reliability: How to achieve high availability and reliability through intelligent traffic handling and failover mechanisms.
Improved User Experience: Leverage progressive delivery patterns to minimize downtime.
Efficient Rollback Mechanisms: Easily roll back changes in the event of issues, ensuring quick recovery and minimal impact on users.

To achieve the above, we need to understand and employ proven patterns that assist us in improving our delivery pipelines.

Deployment Patterns

The practice of software development has benefited from a wealth of "design patterns" - repeatable blueprints for successfully and sustainably building applications. There also exist patterns for the deployment and operation of these applications. You may have heard of deployment patterns such as "Blue-green", "Canary deployments", and "A/B Testing" which help route and shape application traffic over time. While these deployment patterns seem promising, until the last several years, you had been left to your own devices (no pun intended) to implement them. However, proxies such as NGINX, Envoy and service meshes such as Istio have been able to assist in the execution of these patterns as application rollout strategies. While these solutions fit neatly into a Kubernetes environment and can route and process traffic into the applications residing in the cluster, many of them have their own specific behaviors and differences in configuration.

Fortunately, the Kubernetes community responded to this shared need to unify such disparate implementations moving forward. Specifically, the SIG-NETWORK community released a common specification for modern application delivery services in Kubernetes called Gateway API. Gateway API is an official Kubernetes project focused on L4 and L7 routing in Kubernetes. This project represents the next generation of Kubernetes Ingress, Load Balancing, and Service Mesh APIs. From the outset, it has been designed to be generic, expressive, and role-oriented.

The overall resource model focuses on three separate personas and corresponding resources that they are expected to manage:

Kubernetes Gateway API Resources (credit: k8s.io)

We will explore using some of these resources later in this article.

This new specification has been supported by a number of projects and vendors, most notably the F5 NGINX Gateway Fabric.

At KubeCon 2023, F5 NGINX unveiled 1.0 of NGINX Gateway Fabric, supporting traffic splitting patterns such as Canary and Blue-green intrinsically.

NGINX Gateway Fabric high-level architecture

While the capability to execute these application rollouts in a vendor-neutral way is a win for the community, some challenges remain:

How can we be assured that our applications are working after deploying new versions?
What if we want to orchestrate the progressive introduction of application changes rather than a "big-bang" deployment?
If my deployment is not successful, how will I know if I need to roll back?

We are getting there. Read on...

Progressive Delivery

Progressive delivery has emerged as a preferred approach for modern applications for a number of reasons. It can be a means to orchestrate a gradual feature rollout for an application. It can help reduce risk by deploying changes only to a subset or test group of users first. You can also use progressive delivery to gather early feedback on a new application deployment and other use cases. This is not meant to be an exhaustive list of use cases, but only an introduction.

A Solution

NGINX is the most widely used API gateway, application proxy, and web server in the world. It has been very popular in the Kubernetes community, as the de-facto Ingress Controller. With NGINX's Gateway Fabric packaging, this workhorse has renewed its reputation for seamless integration with Kubernetes.

How do we orchestrate progressive delivery? There is more we need.

Our friends at the Argo Project (famous for their GitOps platform, ArgoCD) released Argo Rollouts back in 2019, focused on this progressive delivery problem. Argo Rollouts enables the orchestration of Canary deployments, Blue-green deployments, and experimentation with traffic splitting.

Canary Deployments (credit:argoproj.io)

However, in order for ingress proxies and service meshes to execute the traffic shaping portion of application rollouts, directly extending the Argo Rollouts code was necessary. As a consequence, this added complexity, and slowed down contributions. As you might imagine, the amount of effort required to develop and maintain all these vendor and implementation-specific extensions has been a chore to say the least.

Fortunately, on April 5, 2023, the Argo Project announced that a plugin system for Argo Rollouts had been developed. This would enable future extensibility without having to contribute to the core Argo Rollouts project itself. This was a welcome addition, and would set up the Argo Project for yet another important innovation.

To much celebration, the Argo Project announced on June 20, 2024 that Argo Rollouts now supports ingresses, gateways and service meshes that implement the Kubernetes Gateway API for progressive delivery. Unsurprisingly, NGINX Gateway Fabric, powered by the most popular proxy in the world, is one of the supported providers.

The Details

Without further ado, let's set this solution stack up for a test drive. You are going to need a Kubernetes cluster, ideally version 1.25 or greater so we can take advantage of features in the latest version of the Gateway API. While the installation could be performed entirely by script or using something like Argo CD, we will be performing each step manually for learning purposes.

Here is the overall architecture of the demo environment. We will deploy and configure everything you see here step-by-step.

Demo diagram

Local tools needed:

Helm
kubeconfig (set up with a config valid for your test cluster)
git

We will begin by installing NGINX Gateway Fabric. We will be using the NGINX Plus edition to take advantage of its extended metrics and seamless configuration reload capabilities.

Note: It is not recommended to perform the demo steps in a production Kubernetes cluster without prior validation in a less-critical environment.

Installation

Create the namespace for NGINX Gateway Fabric:
```
kubectl create namespace nginx-gateway
```
Create a Secret to pull the NGINX Gateway Fabric container from the F5 private registry. The secret is based on the contents of the trial JWT from MyF5. If you do not have a trial JWT, you can request one here.
```
kubectl create secret docker-registry nginx-plus-registry-secret --docker-server=private-registry.nginx.com --docker-username=`cat the_full_path_to_you_jwt_here` --docker-password=none -n nginx-gateway
```
The Kubernetes Gateway API is not included in clusters by default. We need to install its Custom Resource Definitions (CRDs) in order to use it:
```
kubectl kustomize "https://github.com/nginxinc/nginx-gateway-fabric/config/crd/gateway-api/standard?ref=v1.3.0" | kubectl apply -f -
```

Next, we will install NGINX Gateway Fabric via its Helm chart:

helm install ngf oci://ghcr.io/nginxinc/charts/nginx-gateway-fabric --set nginx.image.repository=private-registry.nginx.com/nginx-gateway-fabric/nginx-plus --set nginx.plus=true --set serviceAccount.imagePullSecret=nginx-plus-registry-secret -n nginx-gateway --version 1.3.0

Run the following command to wait until NGINX Gateway Fabric has been verified as deployed:

kubectl wait --timeout=5m -n nginx-gateway deployment/ngf-nginx-gateway-fabric --for=condition=Available

We will use Prometheus metrics to inform Argo Rollouts of the state of our application's health during rollouts. Add the Prometheus community helm chart, then update the Helm repo:
```
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
```

Install Prometheus on your cluster in its own namespace:

helm install prometheus prometheus-community/prometheus -n prometheus --create-namespace --set server.global.scrape_interval=15s

Create a namespace for Argo Rollouts, and install it using manifests from the project's GitHub repo:

kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

Install the Argo Rollouts CLI using the instructions for your client platform.

Install the Gateway API Plugin for Argo Rollouts:

kubectl apply -f gateway-plugin.yml -n argo-rollouts

Restart the Argo Rollouts controller so that it detects the presence of the plugin:
```
kubectl rollout restart deployment -n argo-rollouts argo-rollouts
```

Then, check the controller logs. You should see lines verifying that the plugin has been loaded:

time="XXX" level=info msg="Downloading plugin argoproj-labs/gatewayAPI from: https://github.com/argoproj-labs/rollouts-plugin-trafficrouter-gatewayapi/releases/download/v0.2.0/gateway-api-plugin-linux-amd64"
time="YYY" level=info msg="Download complete, it took 7.792426599s"

Demo Setup

Now that our cluster services are in place, we will now use NGINX Gateway Fabric and Argo Rollouts to perform a Canary deployment from a fork of the Argo Rollouts demo application. This type of deployment assumes we will start with a known good or "stable" deployment into our cluster. Once this is deployed, we can update the deployment to introduce new versions of an application's pods in a gradual manner. I will highlight how this works along the way.

With git, clone the demo repo:

git clone https://github.com/aknot242/rollouts-demo.git

Change directory to your repo clone's nginx-gateway-fabric directory:
```
cd rollouts-demo/examples/nginx-gateway-fabric
```
Apply the Gateway manifest:
```
kubectl apply -f gateway.yaml
```
A Gateway is a resource that has been added by the Kubernetes Gateway API, and will be used by NGINX Gateway Fabric to accept requests coming into the cluster.

This and all additional resources are deployed into the default namespace. If you need to install into a different namespace, add -n your_namespace_here to each of the kubectl commands.
Apply the HTTPRoute resource manifest:
```
kubectl apply -f httproute.yaml
```
The Kubernetes Gateway API also added the HTTPRoute resource. This resource will reference the Gateway object, and contains L7 routing rules to direct traffic to the configured Kubernetes services.
Next, deploy two services: a service to represent the "stable" version of our application, and one to represent our "canary" version:
```
kubectl apply -f stable-service.yaml
kubectl apply -f canary-service.yaml
```
Why do we need two different services that contain the same app selector?
When implementing a Canary deployment, Argo Rollouts requires two different Kubernetes services: A "stable" service, and a "canary" service. The "stable" service will direct traffic to the initial deployment of the application. In subsequent (or "canary") deployments, Argo Rollouts will transparently configure the "canary" service to use the endpoints exposed by the pods referenced in the new deployment. NGINX Gateway Fabric will use both of these services to split traffic between these two defined services based on the Argo Rollout rules.
Now we will deploy the AnalysisTemplate resource, provided by Argo Rollouts. This resource contains the rules to assess a deployment's health, and how to interpret this data. In this demo, we will be using Prometheus query to examine the canary service's upstream pods for the absence of 4xx and 5xx HTTP response codes as an indication of its health.
```
kubectl apply -f analysis-success-rate.yaml
```
Finally, deploy the app itself. This step introduces the Rollout resource, provided by Argo Rollouts. This resource is simply an extension of the familiar Deployment resource, but contains additional objects to instruct Argo Rollouts in how to deploy subsequent versions of this application, while monitoring their health:
```
kubectl apply -f rollout.yaml
```
Use `kubectl` to verify that there are five replicas of the application running:
```
kubectl get pods
```

Monitoring the Stable Rollout

Now that the application is fully deployed, use Argo Rollouts' kubectl plugin to observe the initial state of the application. Open a new shell window, and run the following:
```
kubectl argo rollouts get rollout rollouts-demo -w
```
Note: Leave this running in its own window, as we will use it through the rest of the demo.

You should see the following:

In the image above, you will observe several things (in no particular order):

- The rollout of the "stable" version of the application was successful, with five replicas of the rollouts-demo:blue container image created.
- There are eight steps to this progressive rollout. Refer to the rollout.yaml file to see the configured stages.
- You should see that the rollout is Progressing, awaiting a canary rollout.
Open a new shell window, and set up port forwarding to the NGINX Gateway Fabric pod's NGINX container:
```
NGINX_POD=`kubectl get pods --selector=app.kubernetes.io/name=nginx-gateway-fabric -n nginx-gateway -o=name`
kubectl port-forward $NGINX_POD -n nginx-gateway 8080:80
```
Note: In a production scenario, we would not have used a port forward. Rather, we would have likely used a Service of type LoadBalancer to reach the NGINX Gateway Fabric instance. However, this requires additional setup, and varies greatly depending on where your cluster is hosted.
Open a browser to `http://localhost:8080`. You will be presented with the demo application:

In the grid, each blue square represents a connection being made through NGINX Gateway Fabric to one of the demo pods hosted in the rollout deployment. Since we are using the rollouts-demo:blue container image, the service responds with that color. The area at the bottom represents the portion of responses by a particular color over time.

Important: Certain browsers like Chrome will put a tab to sleep if the tab is not the actively selected one. As a result, if you switch to another tab, the HTTP calls to the demo app will cease until the tab has been activated once again.

Initiating a Canary Rollout

Now that we have a stable version of our rollout running, we need to introduce an application change. The demo project contains different variants of the color-producing services, so we will leverage these so you can easily see what is happening. To update the deployment, you can either update the rollout.yaml file to use a different image, or use the Argo Rollouts CLI. We will do the latter.

In the original shell window you opened, use the CLI to start a Canary rollout of the rollouts-demo:yellow image:
```
kubectl argo rollouts set image rollouts-demo "*=argoproj/rollouts-demo:yellow"
```
Switch to the shell window that is monitoring the rollout status. You should see something similar to the following:

The canary rollout has begun, and has paused at the 30% traffic stage, as directed by the rollout rules in rollout.yaml.
Switch back to your browser and you should see something similar to this:
Switch back to the rollout status shell window.

Note the AnalysisRun is continually running. Why? Since this canary rollout has only partially progressed, Argo Rollouts continues to monitor the results of the Prometheus query configured in the AnalysisTemplate resource (in analysis-success-rate.yaml) to ensure there are no 4xx-5xx HTTP errors present in the NGINX upstream metrics produced for this canary service. This analysis template is configured to run every minute. This rollout has a rule set to pause indefinitely at 30% traffic. How do we complete the rollout?
Switch to your original shell and run the following to continue (or, "promote") the rollout:
```
kubectl argo rollouts promote rollouts-demo
```
In the shell window that the kubectl rollouts plugin is running, you will see the rollout progress over the next several minutes - gradually shifting traffic over to the canary service represented in the demo app. Once complete, the canary service will become the new stable release.
Once the rollout completes, you should see that the UI of the demo application has shifted entirely over to using the yellow service.
Switch to the kubectl rollouts plugin window. You will notice that there are 2 replicasets with 5 pods each - one representing the new stable release, and another representing the previous release. Since all the traffic is being routed to the new stable release, why do we need to keep the previous pods around? If it were determined that there is an undetected issue with the stable release, you can switch back to the previous revision of the application without having to wait for new pods to initialize.

Simulate a Failed Rollout

We have just seen what an ideal rollout looks like. What about a rollout where failures are detected by the AnalysisTemplate based on Prometheus metrics? We will deploy a new Canary version of this app by selecting an image that intentionally throws HTTP errors.

In the original shell window you opened, use the CLI to start a Canary rollout of the rollouts-demo:bad-red container image:
```
kubectl argo rollouts set image rollouts-demo "*=argoproj/rollouts-demo:bad-red"
```
Switch back to your browser and you will see something similar to this:
Wait at least a minute, and you should see something like this:

You will see a portion of red service responses for about a minute, then reverts back to 100% yellow. Why? Look at the kubectl argo rollouts plugin to see what is going on. You may observe that the AnalysisRun has failed at least one time, triggering Argo Rollouts to perform an automatic rollback to the last successful rollout version. Cool, right?

Conclusion

This was only a taste of what could be achieved with Argo Rollouts and NGINX Gateway Fabric. However, I hope you have witnessed the benefits of adopting progressive delivery patters using tools such as these. I would encourage you to further explore what you can accomplish in your own environments.

Published Jul 30, 2024

Version 1.0

application delivery

cloud

devops