Implement circuit breaking in Consul service mesh with Envoy
When services in a service mesh communicate with each other, a quiet service failure can result in increased latency in one service or cascading failures to downstream dependencies. For example, a web server's health checks could be still considered healthy if both its network and transport stacks are working but return HTTP 5xx errors. Circuit breaking is a pattern designed to prevent quiet failures by removing service instances that persistently return errors.
Circuit breaking is typically achieved by modifying the service's code or configuring the network. While some libraries and frameworks offer circuit breaking through code, if you have services written in different languages you can the implementations may not be consistent. In other cases, it may not be desirable or possible to modify the service code. Implementing circuit breaking through the service mesh does not require you to modify application code and decouples infrastructure concerns from application logic.
In this tutorial, you will implement circuit breaking in the Consul service mesh by applying a ServiceDefaults
configuration entry to configure Envoy proxies.
Scenario overview
HashiCups is a coffee shop demo application. It has a microservices architecture and uses Consul service mesh to securely connect the services. The Terraform deployment sets up HashiCups microservices (nginx
, frontend
, public-api
, product-api
, product-db
, and payments
) in the Kubernetes environment.
In this tutorial, you will deploy HashiCups and three public-api
service instances. You will use a traffic generator to create a steady flow of requests towards the public-api
service.
Next, you will simulate failure by configuring two instances from the public-api
service to respond to HTTP requests with HTTP 500 errors.
Finally, you will configure circuit breaking by modifying the ServiceDefaults
configuration entry for the traffic-generator
service. This will instruct the traffic-generator
service not to send requests to the public-api
instances that respond with HTTP 500 errors.
Prerequisites
The tutorial assumes that you are familiar with Consul and its core functionality. If you are new to Consul, refer to the Consul Getting Started tutorials collection.
For this tutorial, you will need:
- An AWS account configured for use with Terraform
- kubectl >= 1.28
- aws-cli >= 2.13.19
- terraform >= 1.5.7
- consul-k8s v1.3.0
- helm >= 3.12.3
Clone GitHub repository
Clone the GitHub repository containing the configuration files and resources.
$ git clone https://github.com/hashicorp-education/learn-consul-circuit-breaking
Change into the directory with the newly cloned repository.
$ cd learn-consul-circuit-breaking
This repository contains Terraform configuration to spin up the initial infrastructure and all files to deploy Consul, the HashiCups sample application, and the traffic-generator.
Here, you will find the following Terraform configuration:
eks-cluster.tf
defines Amazon EKS cluster deployment resourcesoutputs.tf
defines outputs you will use to authenticate and connect to your Kubernetes clusterproviders.tf
defines AWS provider definitions for Terraformvariables.tf
defines variables you can use to customize the tutorialvpc.tf
defines the AWS VPC resources
Additionally, you will find the following directories:
api-gw
contains the Kubernetes custom resource definitions (CRDs) required to deploy and configure the API gateway resourcesconsul
contains the Helm chart that configures your Consul instancek8s-services
contains the Kubernetes definitions that deploys the sample application
Deploy infrastructure, Consul, and sample applications
Initialize your Terraform configuration to download the necessary providers and modules.
$ terraform init
Initializing the backend...
Initializing provider plugins...
## ...
Terraform has been successfully initialized!
## ...
Then, create the infrastructure. Confirm the run by entering yes
. This will take about 15 minutes to deploy your infrastructure. Feel free to explore the next sections of this tutorial while waiting for the resources to deploy.
$ terraform apply
## ...
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
## ...
Apply complete! Resources: 99 added, 0 changed, 0 destroyed.
Configure your terminal to communicate with EKS
Now that you have deployed the Kubernetes cluster, configure kubectl
to interact with it.
$ aws eks --region $(terraform output -raw region) update-kubeconfig --name $(terraform output -raw cluster_name)
Explore Kubernetes services
Open the Consul UI URL and observe the request metrics that are sent from the traffic-generator
towards the public-api
service. The uninterrupted gray space represents the uninterrupted steady flow of successful requests between the two services.
$ export CONSUL_UI=https://$(kubectl --namespace consul get services consul-ui -o jsonpath='{ .status.loadBalancer.ingress[].hostname }') && echo ${CONSUL_UI}/ui/dc1/services/public-api/topology
Simulate failure
The service public-api
has a configurable setting where it can return HTTP error 500 on demand. To simulate failure, you will now configure instances public-api-v2
and public-api-v3
from the service public-api
to return an error 500 for 100% of the served requests.
$ kubectl apply --filename k8s-services/failing-service-public-api.yaml
By default, Consul service mesh is load-balancing over all available service instances. The service public-api
now has two instances which are returning HTTP error 500, which leads to a high error rate. Wait a couple of minutes for the metrics to populate and observe the errors for public-api
in the Consul service page for public-api
. The graph plots errors in red.
Set up circuit breaking
To implement circuit breaking, you must configure two Envoy settings:
- The Envoy circuit breaking feature implements the bulkhead pattern, which sets the maximum, pending, and concurrent connections for a pool of upstream service instances.
- Envoy outlier detection handles the ejection of services that are flagged by the circuit breaker. If an upstream service instance returns the maximum allowed consecutive HTTP 5xx errors, Envoy will eject the service instance from the pool of upstream service instances. The combination of these two settings effectively implements the circuit breaker pattern, first by detecting failures and then by ejecting the failed service instance.
In Consul service mesh, you can set the values for these settings by creating and applying a ServiceDefaults
configuration entry on the source (downstream) service. These settings will then apply for requests sent towards the destination (upstream) service.
The k8s-services/servicedefaults-traffic-generator.yaml
contains a sample ServiceDefault
for the traffic-generator
service that trips the circuit breaker after 10 public-api
service failures (with the maxFailures
setting) and retry an ejected public-api
instance after five seconds (with the interval
setting).
The same circuit breaking settings can also be configured for every service globally by applying them to a ProxyDefaults
configuration entry instead. For more information, refer to the ProxyDefaults documentation for more information and examples.
k8s-services/servicedefaults-traffic-generator.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceDefaults
metadata:
name: traffic-generator
spec:
protocol: http
upstreamConfig:
defaults:
connectTimeoutMs: 5000
limits:
maxConnections: 30
maxPendingRequests: 40
maxConcurrentRequests: 50
passiveHealthCheck:
interval: "5s"
maxFailures: 10
enforcingConsecutive5xx: 100
maxEjectionPercent: 100
baseEjectionTime: "10s"
The following are configurable parameters for circuit-breaking in the Consul service mesh.
maxConnections
: Specifies the maximum number of connections a service instance can establish against the upstream for HTTP/1.1 trafficmaxPendingRequests
: Specifies the maximum number of requests that are queued while waiting for a connection to establishmaxConcurrentRequests
: Specifies the maximum number of concurrent requests for HTTP/2 trafficinterval
: Specifies the time between checking the upstream service instance availabilitymaxFailures
:Specifies the number of consecutive failures allowed per check interval. If exceeded, Consul removes the host from the list of upstream service instancesenforcingConsecutive5xx
: Specifies a percentage that indicates how many times out of 100 that Consul ejects the service instance when it detects a failed status. The failed status is determined by consecutive errors in the HTTP 500-599 response rangemaxEjectionPercent
: Specifies the maximum percentage of upstream service instances that Consul ejects when the health check reports a failure. Consul ejects at least one service instance when a failure is detected regardless of this valuebaseEjectionTime
: Specifies the minimum amount of time that an ejected service instance must remain considered as failed before being considered healthy. The effective ejection time is equal to the value of thebaseEjectionTime
multiplied by the number of times the host has been ejected already
Note
If you do not specify limits
or passiveHealthCheck
, Consul uses Envoy's outlier detection defaults.
Apply circuit-breaking parameters for the traffic-generator
service.
$ kubectl apply --filename k8s-services/servicedefaults-traffic-generator.yaml
Verify ejection of failing service instances
Next, observe the Consul UI traffic metrics for public-api
. Once the circuit-breaking trigger is reached, the two instances of the service public-api
that respond with HTTP error 500 are ejected from the upstream destination pool and the error rate stops. After the specified interval, the instances are reintroduced into the upstream destinations. When they respond with errors again, the passive health check activates the circuit breaking pattern which ejects them again.
The Prometheus metrics collector keeps track of active ejections per service and can graph these metrics. Next, open a port forward to the Prometheus service.
$ kubectl port-forward -n consul service/prometheus-server 9090:80 > /dev/null 2>&1 &
Observe the metrics in the Prometheus UI. The graph shows that there are currently two active ejections for the product-api
service.
The graph may differ for each deployment as different requests to the two failing instances can either sync up, or overlap, and show either one or two active ejections at the beginning of the metrics. After a few minutes, however, it should stabilise at two active ejections. The reason for that is the active ejection period for each failing instance becomes longer as the instance continues to error. The ejection period is calculated by multiplying the baseEjectionTime
by the amount of times that the instance has already been ejected. For more information, refer to the Envoy ejection algorithm documentation.
Clean up environment
Destroy the Terraform resources to clean up your environment. Enter yes to confirm the destroy operation.
$ terraform destroy
## ...
Destroy complete! Resources: 99 destroyed.
Due to race conditions with the various cloud resources created in this tutorial, you may need to run the destroy
operation twice to ensure all resources have been properly removed.
Next step
In this tutorial, you used circuit breaking as a solution for routing traffic to the healthy instances of the services running on your HashiCorp Consul service mesh. In the process, you learn the benefits of fine-tuning the health checking and the outlier detection for service mesh, and how circuit breaking helps keep services healthy and functional.
Feel free to explore these tutorials and collections to learn more about Consul service mesh, microservices, and Kubernetes.