Collect Terraform Enterprise metrics with Prometheus and Grafana

15min
|
Enterprise
Terraform

Terraform Enterprise is critical infrastructure for many businesses. You can proactively monitor your Terraform Enterprise instance's performance by enabling its built-in metrics endpoint and connecting it to your monitoring system. These steps will help you prevent issues and outages, which can be disruptive and costly.

The metrics provided by Terraform Enterprise let you identify which worker containers exist at any given time, monitor their performance in order to troubleshoot and prevent run issues, and make informed run concurrency and capacity constraint decisions.

In this tutorial you will enable usage-metric collection in Terraform Enterprise and provision an instance running Prometheus and Grafana. Then you will configure Prometheus to scrape metrics from your Terraform Enterprise deployment and visualize them on a Grafana dashboard.

Note

You will send Terraform Enterprise metrics including run_type, run_id, workspace_name, and organization_name to a demo Prometheus and Grafana instance. If your organization treats this data as sensitive, check with your security team before running the tutorial.

Prerequisites

This tutorial assumes that you are familiar with Terraform Enterprise and the standard Terraform workflow. If you are new to Terraform, complete the Get Started tutorials first. If you are new to Terraform Enterprise, read the Terraform Enterprise documentation first.

For this tutorial, you will need:

An existing Terraform Enterprise deployment in standalone mode in AWS, running version 202207-1 or newer, with ports 9090 and 9091 available. You will also need administrator access to your Terraform Enterprise dashboard.
The Terraform CLI (1.1.0+) installed.
An AWS account with credentials set as local environment variables.

Note

This tutorial will provision resources that qualify under the AWS free-tier. If your account does not qualify under the AWS free-tier, we are not responsible for any charges that you may incur.

Enable TFE metrics endpoint

Open the Settings page, and click on Metrics.

The Terraform Enterprise settings screen with the word "Metrics" highlighted

Click Enable Metrics Endpoint. Leave the default port settings set to 9090 and 9091 for HTTP and HTTPS respectively. In this tutorial, you will only use port 9090.

Tip

If ports 9090 and 9091 are unavailable in your network, you can change the port settings to use a different port. If you do so, replace 9090 with the new port as you follow the rest of this tutorial.

Scroll down to the bottom of the settings page and click on the Save button. Then, click on Restart Now to restart Terraform Enterprise and apply the settings.

Clone the example repository

Clone the example code repository, which you will use to deploy demonstration instances of Prometheus and Grafana.

$ git clone https://github.com/hashicorp-education/learn-terraform-enterprise-metrics-prometheus

The configuration in this repository deploys a demonstration instance running both Prometheus and Grafana.

Warning

Do not use this instance configuration in production. The security group allows inbound and outbound traffic to the Prometheus and Grafana instance from any IP address.

Navigate to the cloned repository.

$ cd learn-terraform-enterprise-metrics-prometheus

Review configuration

Open main.tf, which configures AWS resources for your Prometheus and Grafana deployment, including:

an AWS security group
an IAM role and role policy
instance profile
an EC2 instance

The prometheus security group lets your browser connect to the instance on port 80 and 443, and Grafana and Prometheus to connect on ports 9090 and 3000.

resource "aws_security_group" "prometheus" {
  name        = "prometheus"
  description = "Learn tutorial Security Group for prometheus instance"

  ingress {
    description = "Allow port 9090 inbound"
    from_port   = 9090
    to_port     = 9090
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "Allow port 3000 inbound"
    from_port   = 3000
    to_port     = 3000
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "Allow port 80 inbound"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "Allow port 443 inbound"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    description = "Allow port 9090 outbound"
    from_port   = 9090
    to_port     = 9090
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    description = "Allow port 3000 outbound"
    from_port   = 3000
    to_port     = 3000
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    description = "Allow port 80 outbound"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    description = "Allow port 443 outbound"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Next, the configuration defines an IAM role, role policy, and instance profile which grant AmazonEC2ReadOnlyAccess permissions to the Prometheus instance. This will let the Prometheus EC2 instance use the AWS API to find your Terraform Enterprise EC2 instance without needing your AWS credentials.

First the configuration creates the IAM role resource, because the IAM instance profile depends on the role.

resource "aws_iam_role" "prometheus_aws_iam_role" {
  name = "prometheus_aws_iam_role"

  assume_role_policy = <<EOF
{
   "Version": "2012-10-17",
   "Statement": [
     {
       "Action": "sts:AssumeRole",
       "Principal": {
         "Service": "ec2.amazonaws.com"
       },
       "Effect": "Allow",
       "Sid": ""
     }
   ]
 }
 EOF
}

Next the configuration creates the IAM instance profile and associates it with the IAM role.

resource "aws_iam_instance_profile" "prometheus_iam_instance_profile" {
  name = "prometheus_iam_instance_profile"
  role = aws_iam_role.prometheus_aws_iam_role.name
}

Then, the configuration defines an IAM role policy associated with the IAM role.

resource "aws_iam_role_policy" "prometheus_iam_role_policy" {
  name = "prometheus_iam_role_policy"
  role = aws_iam_role.prometheus_aws_iam_role.id

  policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "ec2:Describe*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "elasticloadbalancing:Describe*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:ListMetrics",
                "cloudwatch:GetMetricStatistics",
                "cloudwatch:Describe*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "autoscaling:Describe*",
            "Resource": "*"
        }
    ]
}
 EOF
}

Next, the configuration defines an EC2 instance that will host the Prometheus and Grafana applications.

resource "aws_instance" "prometheus_instance" {
  ami                    = data.aws_ami.amazon_linux.id
  instance_type          = "t2.micro"
  vpc_security_group_ids = [aws_security_group.prometheus.id]
  subnet_id              = data.aws_instance.get_existing_tfe_subnet_id.subnet_id
  iam_instance_profile   = aws_iam_instance_profile.prometheus_iam_instance_profile.name
  user_data              = templatefile("prometheus-install.sh.tftpl", { tfe_tag_name = var.tfe_tag_name, aws_region = var.aws_region })
  tags = {
    "Name" = "prometheus_instance"
  }
}

Notice that the iam_instance_profile attribute grants this instance read-only access to the AWS API, as defined in the IAM role policy. The user_data attribute uses the templatefile() function and template variables to generate a script that installs Prometheus and Grafana.

Now open prometheus-install.sh.tftpl, which is a Terraform template for a script that will run when the EC2 instance boots. The script downloads, configures, and starts Prometheus and Grafana.

Review the portion of the script that generates a Prometheus configuration file.

sudo cat > /etc/prometheus/prometheus.yml << "EOF"
global:
  scrape_interval: 10s

scrape_configs:
  - job_name: tfe
    params:
      format:
        - prometheus
    relabel_configs:
    - source_labels: [__meta_ec2_instance_id]
      regex: (.*)
      target_label: instance
      replacement: ${1}
      action: replace
    ec2_sd_configs:
    - endpoint: ""
      region: ${aws_region}
      refresh_interval: 1m
      port: 9090
      filters:
        - name: tag:Name
          values:
            - ${tfe_tag_name}
EOF
##...

The scrape configuration file contains the Prometheus job_name for the Terraform Enterprise scrape target. The ec2_sd_configs block identifies the Terraform Enterprise instance in the correct region while filtering on the Name tag value.

The script configuration is intended only for demonstration environments which do not have a Prometheus and Grafana deployment. Your existing Prometheus instance may require alternative configuration options, which you can review in the Prometheus configuration documentation.

Set input variables

Open terraform.tfvars in your code editor, and update its values:

Set aws_region to the same region as your existing Terraform Enterprise instance. Prometheus will search for your Terraform Enterprise instance within the specified region.
Set tfe_tag_name to the name tag of your Terraform Enterprise instance. Prometheus will identify the instance based on the name tag.

# Update this value to the same region as your existing Terraform Enterprise instance.
aws_region = "example_region"

# Update this value to match your existing Terraform Enterprise 'name' tag value.
tfe_tag_name = "example-tag-name"

Deploy Prometheus and Grafana

In your terminal, initialize your configuration.

$ terraform init

Initializing the backend...

Initializing provider plugins...
- Finding hashicorp/aws versions matching "~> 4.4.0"...
- Installing hashicorp/aws v4.4.0...
- Installed hashicorp/aws v4.4.0 (signed by HashiCorp)

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

Now, apply the configuration to create your infrastructure. Respond yes to the prompt to confirm the operation.

$ terraform apply

##...

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

Plan: 5 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + grafana_dashboard_url    = (known after apply)
  + prometheus_dashboard_url = (known after apply)

##...
aws_iam_role.prometheus_aws_iam_role: Creation complete after 1s [id=prometheus_aws_iam_role]
aws_iam_role_policy.prometheus_iam_role_policy: Creation complete after 1s [id=prometheus_aws_iam_role:prometheus_iam_role_policy]
aws_iam_instance_profile.prometheus_iam_instance_profile: Creation complete after 1s [id=prometheus_iam_instance_profile]
aws_security_group.prometheus: Creation complete after 3s [id=sg-0740d374a0bffff55]
##...
aws_instance.prometheus_instance: Creation complete after 1m11s [id=i-0e2e4ee0249363d87]

Apply complete! Resources: 5 added, 0 changed, 0 destroyed.

Outputs:

grafana_dashboard_url = "http://54.152.163.141:3000"
prometheus_dashboard_url = "http://54.152.163.141:9090/graph"

Verify Prometheus and Grafana instance

Navigate to the endpoints in your Terraform output values to verify your deployment.

Note

Allow a few minutes for Prometheus and Grafana to complete the setup. Both services will report as unresponsive until this setup is complete.

First, visit the Prometheus dashboard. Copy the prometheus_dashboard_url output value into your browser, leaving off the double quotes.

The default Prometheus screen with no active query

Then, visit the Grafana dashboard. Copy the grafana_dashboard_url output value into your browser, excluding the double quotes. Log in with the default username and password: admin and admin. You will be prompted to create a new password. Click Skip below the input fields.

The Grafana login page showing no username or password

View metrics in Prometheus

In your Prometheus dashboard, type tfe in the expression field. If Prometheus has successfully scraped the Terraform Enterprise metrics endpoint, Prometheus will display a list of available Terraform Enterprise metrics.

Search for tfe_container_cpu_usage_kernel_ns and press the Execute button. The table and graph tabs now display Terraform Enterprise’s CPU usage metrics.

An example Prometheus query showing tfe_container_cpu_usage_kernel_ns states, with a graph of results below

Terraform Enterprise runs take place in ephemeral containers which can make metrics difficult to understand and parse. Terraform Enterprise metric labels such as run_type, run_id, workspace_name and organization_name can help you identify resource-heavy workspaces or runs.

Paste the following query in the expression field. Replace ORG_NAME with the name of your Terraform Enterprise organization. This query will group memory usage by run_type for the specified organization name. The results will help you identify workspaces that may be using more memory than expected during their Terraform runs. You can then troubleshoot the related Terraform configuration.

sum by (run_type)(tfe_container_memory_used_bytes{organization_name="ORG_NAME"}) / 1024 / 1024

The above query with accompanying graph showing stats over time

Display the CPU usage per Terraform Enterprise container by entering the following query.

sum by (name)((rate(tfe_container_cpu_usage_kernel_ns{run_type=""}[1m]) + rate(tfe_container_cpu_usage_user_ns{run_type=""}[1m])) / 1e7)

Display the memory usage per Terraform Enterprise container by entering the following query.

sum by (name)(tfe_container_memory_used_bytes{run_type=""}) / 1024 / 1024

You can learn more about creating your own Prometheus queries in the Prometheus documentation.

View metrics in Grafana

Grafana is an open-source tool that interactively displays metrics from many different platforms. It provides charts, graphs, and alerts when connected to supported data sources like Prometheus.

Go to your Grafana dashboard and add the Prometheus application as a data source for Grafana.

Click on the cogwheel in the left sidebar to open the Configuration menu.
Click on Data Sources.
Click on Add data source.
Select Prometheus as the type.
Set the appropriate prometheus_dashboard_url endpoint returned in your Terraform outputs without the /graph path (i.e. http://<IP>:9090). Click Save & Test to save the new data source.

The configuration screen for the Prometheus data source, with a highlight box around the URL field showing the URL of the Prometheus server

Now add a sample dashboard that is maintained by the Hashicorp Terraform engineering team. The dashboard contains several panels that are populated with Terraform Enterprise metrics.

Hover over the Dashboards icon (the icon with four squares) in the left sidebar
Click on + Import
Paste the ID 15630 into the import field and click the Load button next to the field
On the next screen, select your Prometheus data source from the drop down menu and click the Import button

The Grafana dashboard import screen showing the Terraform Enterprise Grafana dashboard being imported and pointing to the Prometheus data source

The dashboard loads and displays scraped metrics, which show you how your Terraform Enterprise instance is performing. The dashboard also displays usage data, including the number of current plans and applys, the number of run per workspace, and runs per organization. You can use this data to optimize your Terraform Enterprise instance's performance.

Note

By default, graphs in this dashboard are created by grouping values in 10-second intervals. If you see multiple panels reporting "No data", increase the rangeInterval by using the drop-down menu at the top of the screen.

For example, by default Terraform Enterprise executes up to 10 parallel Terraform runs at a time. If you attempt to execute more than 10 runs at a time, Terraform Enterprise will queue them until it has finished executing previous runs. If you frequently see a large number of runs in the "Runs In Queue" Grafana dashboard panel it may indicate that your current capacity settings should be optimized. To reduce wait times, you could increase the number of parallel runs in the capacity and concurrency settings menu.

You could also monitor which workspaces or organizations execute the most runs by monitoring the current run breakdown per organization and top run counts per workspace and adjust their user workload for better distribution of tasks.

A section of the imported Grafana dashboard showing the top run counts per workspace, current run breakdown per organization, runes in queue, runs breakdown, and run completion rate per minute

The dashboard also shows per-container resource consumption, including memory consumption, disk I/O read and write, and container network ingress/egress. Terraform Enterprise runs take place in Docker containers that only exist during the duration of the run. If your Terraform configuration causes a run to use too much memory, disk I/O, or transmit too much data, this dashboard will help you identify it and determine if the problem is caused by inefficient Terraform configuration, or if you need to increase the capacity of the run containers.

A section of the imported Grafana dashboard showing multiple line graphs, such as the container CPU usage, container memory usage, disk I/O read, disk I/O write and network traffic

You can also customize the dashboard by following the Grafana documentation.

Destroy Prometheus and Grafana

Now that you have completed this tutorial, destroy the demonstration instance hosting Prometheus and Grafana. Use production deployments of Prometheus and Grafana for official monitoring and reporting of scraped metrics.

Destroy the instance and respond yes to the prompt to confirm the operation.

$ terraform destroy
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:
##...
Plan: 0 to add, 0 to change, 5 to destroy.

Changes to Outputs:
  - grafana_dashboard_url    = "http://54.152.163.141:3000" -> null
  - prometheus_dashboard_url = "http://54.152.163.141:9090/graph" -> null

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value: yes

##...
Destroy complete! Resources: 5 destroyed.

Next steps

In this tutorial, you used Prometheus and Grafana to display scraped metrics from Terraform Enterprise. You enabled the Terraform Enterprise metrics endpoint in the admin dashboard, set up Prometheus to scrape emitted data, set up Grafana to import the scraped data from Prometheus, and imported a sample dashboard to visualize the data.

Learn more about Prometheus and Grafana:

Visit the Datadog log forwarding Learn pages to learn more about Terraform Enterprise log forwarding.
Review the Prometheus Documentation to learn more about different configuration possibilities.
Review the Grafana Documentation to learn more about interacting with the prometheus data source within Grafana.
Refer to the Monitoring a Terraform Enterprise Instance documentation to learn about the metrics available for your version of Terraform Enterprise.
Visit the Terraform Enterprise Documentation pages to learn more about capacity and concurrency settings for your Terraform Enterprise instance.

Use hcdiag

Next Collection

CDK for Terraform