Prometheus with Alert Manager hosted in Fargate.

11 min readJun 15, 2020

Hopefully, after reading this article you should be able to setup Prometheus in a Docker container using Node Exporter, ECS exporter to scrape metrics.

In an attempt to do the above I made many mistakes and this is by no means a complete guide, only my experience in building a working Prometheus environment hosted using AWS Fargate.

Creating the Docker images

For this build, I created two separate images and pushed them to two separate ECRs. The assumption is that you have already created the two repositories in AWS Elastic Container Registry called alertmanager and prometheus for the images to stored.

Create a working folder called Prometheus, followed by a subfolder called Docker. A further two folders will need to be created called prometheus and alertmanager.

In the alertmanager folder you will need to create a file called Dockerfile with the contents:

FROM prom/alertmanager:latestADD alertmanager.yml /etc/alertmanager/

I also created a very simple route listening for an alert rule called ExportDown in a file called alertmanager.yml:

global:
  resolve_timeout: 5mroute: 
  group_by: ['instance','severity']
  routes:
    - match:
        alertname: ExporterDown
  receiver: 'pushover'
receivers:
- name: 'pushover'

Please note that at the time of writing this the alerting is still in its infancy and a work in progress. I do intend on linking to PagerDuty but need to ensure all required metrics are being scraped before continuing.

In the empty prometheus folder you will again create a file called Dockerfile with the contents:

FROM prom/prometheus:v2.14.0ADD prometheus.yml /etc/prometheus/
ADD irst_rules.yml /etc/prometheus/
ADD targets/*.yml /etc/prometheus/targets/*.yml

a file called prometheus.yml:

global:
  scrape_interval:   5s
  evaluation_interval:   10srule_files:
  - "first_rules.yml"scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']
  - job_name: node-exporter'
    ec2_sd_configs:
      - port: 9100
        region: 'us-east-1'
        profile: '<enter profile name>'
    relabel_configs:
      - source_labels: [__meta_ec2_tag_service_name]
        action: keep
        regex: '<enter regex string>'
      - source_labels: [__meta_ec2_tag_name]
        target_label: instance
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance_id
  - job_name: ecs-exporter
    static_configs: 
      - targets: ['localhost:9222']
  - job_name: 'jmx-exporter'
    ec2_sd_configs:
      - port: 9404
        region: 'us-east-1'
        profile: '<enter profile name>'
    relabel_configs:
      - source_labels: [__meta_ec2_tag_service_name]
        action: keep
        regex: '<enter regex string>'
      - source_labels: [__meta_ec2_tag_name]
        target_label: name
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance_idalerting:
  alertmanagers:
    - static_configs:
      - targets: ['localhost:9093']

Breaking away from the above for just a second you will have noticed I have highlighted a few of the entries. These entries are specific to the AWS profile I have set on my system and the environment I am scraping metrics from. I use a written application called aws-runas when using AWS due to the number of accounts I switch between.

mmmorris1975/aws-runas

A friendly way to do AWS STS AssumeRole operations so you can perform AWS API actions using a particular set of…

github.com

Create a file called first_rules.yml. As you can see it is referenced in the above prometheus.yml file. I currently have 5 rules set in this file:

groups:
  - name: alert
    rules:
    - alert: PrometheusNotConnectedToAlertManager
      expr: prometheus_notifications_alertmanagers_discovered < 1
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Prometheus is not connected to an alertmanager (instance {{ $labels.instance }})"
        description: "Prometheus cannot connect the alertmanager\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
    - alert: ExporterDown
      expr: up{job="node"} == 0 
      for: 3m
      labels:
        severity: warning
      annotations:
        summary: "Exporter down ( instance {{ $labels.instance }})"
        description: "Prometheus exporter down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
    - alert: OutOfMemory
      expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Out of memory ( instance {{ $labels.instance }})"
        description: "Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
    - alert: OutOfDiskSpace
      expr: (node_filesystem_avail_bytes{mountpoint="/rootfs"} * 100) / node_filesystem_size_bytes{mountpoint="/rootfs"} < 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Out of disk space (instance {{ $labels.instance }})"
        description: "Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
    - alert: HighCpuLoad
      expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU load (instance {{ $labels.instance }})"
        description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"

And that is pretty much it for the image creation. We should build and push the two images to Elastic Container Registry (ECR).

Building and pushing to Elastic Container Registry

The folder structure should look like the following:

Prometheus
   Docker
      alertmanager
      prometheus

Before you carry out a Docker Pull / Push from ECR in AWS you need to grab the credentials for authentication. I use the following:

aws-runas <name of account> aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <accountID>.dkr.ecr.us-east-1.amazonaws.com

Where the account profile on my system is contained within the ~/.aws/config file and name accordingly.

In order to build the alertmanager image you should navigate to Prometheus →Docker →alertmanager then:

docker build -t alertmanager .

Tag the image accordingly:

docker tag alertmanager:latest <account>.dkr.ecr.<region>.amazonaws.com/alertmanager:latest

where the account is your AWS account number and region is the region you are using.

And finally, push the tagged image to the repository:

docker push <account>.dkr.ecr.<region>.amazonaws.com/alertmanager:latest

Navigate to Prometheus → Docker → prometheus and do the same as above:

docker build -t prometheus .
docker tag prometheus:latest <account>.dkr.ecr.<region>.amazonaws.com/prometheus:latest
docker push <account>.dkr.ecr.<region>.amazonaws.com/prometheus:latest

The above two images should be present in their respective repositories within ECR and ready to use.

Creating the ECS module

So far we have created the images and pushed them to the Elastic Container Registry in an AWS account. We now need to move onto the module creation and then finally the deployment. All Terraform code mentioned in this article is using version 0.12.9.

To keep thing neat you should navigate back to your Prometheus folder and create a new folder called Modules. Within Modules create another folder called ECS. Once inside the ECS folder, you should create main.tf containing the following:

/*==== IAM roles =====*/resource "aws_iam_role" "ecs-service-role" {
  name                = "${var.tags["environment"]}-ecs-service-role"  assume_role_policy  = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Action": "sts:AssumeRole",
        "Principal": {
          "Service": "ecs.amazonaws.com"
        },
        "Effect": "Allow",
        "Sid": ""
      }
    ]
  }
EOF
}resource "aws_iam_role" "ecs-instance-role" {
    name               = "${var.tags["environment"]}-ecs-instance-role"
    path               = "/"
    assume_role_policy = "${data.aws_iam_policy_document.ecs-instance-policy.json}"
}data "aws_iam_policy_document" "ecs-instance-policy" {
    statement {
        actions = ["sts:AssumeRole"]        principals {
           type        = "Service"
           identifiers = ["ec2.amazonaws.com"]
        }
    }
}resource "aws_iam_role_policy_attachment" "ecs-instance-role-attachment" {
    role        = "${aws_iam_role.ecs-instance-role.name}"
    policy_arn  = arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
}resource "aws_iam_instance_profile" "ecs-instance-profile" {
    name = "${var.env["environment"]}-ecs-instance-profile"
    path = "/"
    role = "${aws_iam_role.ecs-instance-role.id}"
    provisioner "local-exec" {
      command = "sleep 10"
    }
}/*==== Create IAM Task Definition Role =====*/data "aws_iam_policy_document" "ecs-service-policy" {
 statement {
   actions           = ["sts:AssumeRole"}   principals {
     type            = "Service"
     identifiers     = ["ecs-tasks.amazonaws.com"]
   }
  }
}resource "aws_iam_policy" "policy" {
  name               = "${var.env["environment"]}-ecs-exporter"
  policy  = <<EOF
{    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Action": [ 
                "ecs:ListServices",
                "ecs:ListContainerInstances",
                "ecs:ListClusters",
                "ecs:DescribeServices",
                "ecs:DescribeContainerInstances",
                "ecs:DescribeClusters"
            ],
            "Resource": "*"
        }
    ]
}
EOF
}resource "aws_iam_role" "task-definition-role" {
  name               = "${var.env{"environment"]}-task-definition"
  assume_role_policy = "${data.aws_iam_policy_document.ecs-service-policy.json}"
}resource "aws_iam_role_policy_attachment" "ecs-exporter" {
  role               = "${aws_iam_role.task-definition-role.name}"
  policy_arn         = "${aws_iam_policy.policy.arn}"
}resource "aws_iam_role_policy_attachment" "ecs-service-role-attachment" {
  role               = "${aws_iam_role.task-definition-role.name}"
  policy_arn         = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}resource "aws_iam_role_policy_attachment" "prometheus-read-access" {
  role               = "${aws_iam_role.task-definition-role.name}"
  policy_arn         = "arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess"
}/*==== ECS Fargate Cluster =====*/resource "aws_ecs_cluster" "main" {
  name            = "${var.env["environment"]}"
}/*==== CloudWatch =====*/resource "aws_cloudwatch_log_group" "logs"
  name              = "/${var.env["environment"]}/prometheus"
  retention_in_days = "7"
  tags              = "${merge(map("Name",format("%s-cloudwatch",var.tags["environment"])),var.tags)}"
}resource "aws_cloudwatch_log_group" "alertmanager" {
  name              = "/${var.env["environment"]}/alertmanager"
  retention_in_days = "7"
  tags              = "${merge(map("Name",format("%s-cloudwatch",var.tags["environment"])),var.tags)}"
}

Create another file called outputs.tf:

output "ecs_cluster" {
    value = "${aws_ecs_cluster.main.name}"
}
output "cluster_id" {
    value = "${aws_ecs_cluster.main.id}"
}output "task_definition_role_arn" {
    value = "${aws_iam_role.task-definition-role.arn}"
}

And finally, create a file called vars.tf:

variable "env" {type = "map"}
variable "tags" {type = "map"}variable "vpc_id" {}

Creating the Security Group module

Move up a level in the folder structure and create a folder called security-groups. Create a file called main.tf with the following contents:

resource "aws_security_group" "alb-security-group" {
  name        = "${var.env["environment"]}-alb"
  description = "ALB security group for environment ${var.tags["environment"]}"
  vpc_id      = "${var.vpc_id}"  ingress {
    from_port   = 9090
    to_port     = 9090
    protocol    = "TCP"
    cidr_blocks = ["<enter your cidr block>"]
    description = "Prometheus access"
}  ingress {
    from_port   = 9093
    to_port     = 9093
    protocol    = "TCP"
    cidr_blocks = ["<enter your cidr block>"]
    description = "AlertManager access"
}  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
}lifecycle {
  create_before_destroy = true
}tags = "${merge(map("Name",format("%s-alb",var.tags["environment"])),var.tags,
map("service-name",format("%s-security-group",var.tags["environment"])),
map("service-type","security-group"))}"
}resource "aws_security_group" "worker-security-group" {
   name        = "${var.env["environment"]}-worker"
   description = "Worker security group for environment ${var.tags["environment"]}"
   vpc_id      = "${var.vpc_id}"ingress {
  from_port = 9090
  to_port   = 9090
  protocol  = "TCP"
  security_groups = ["${aws_security_group.alb-security-group.arn}"]
  description = "Load balancer to Prometheus"
}ingress {
  from_port = 9093
  to_port   = 9093
  protocol  = "TCP"
  security_groups = ["${aws_security_group.alb-security-group.arn}"]
  description = "Load balancer to AlertManager"
}egress {
  from_port = 0
  to_port   = 0
  protocol  = "-1"
  cidr_blocks = ["0.0.0.0/0"]
}lifecycle {
  create_before_destroy = true
}tags = "${merge(map("Name",format("%s-worker",var.tags["environment"])),var.tags,
map("service-name",format("%s-security-group",var.tags["environment"])),
map("service-type","security-group"))}"
}

Create another file called outputs.tf:

output "alb-security-group" { value = "${aws_security_group.alb-security-group.name}" }output "alb-security-group-id" { value = "${aws_security_group.alb-security-group.id}" }output "worker-security-group" { value = "${aws_security_group.worker-security-group.name}" }output "worker-security-group-id { value = "${aws_security_group.worker-security-group.id}" }

And finally, create a file called vars.tf:

variable "vpc_id" {}variable "tags" { type = "map"}variable "env" { type = "map"}

Creating the application module

This is the final module required by the build containing the Terraform files with an additional template subfolder for the task definition. In the modules folder, you should create app, then a subfolder called templates. At this stage your folder structure should pretty much look like this:

prometheus
   docker
      alertmanager
      prometheus
   modules
      ecs
      security-groups
      app
         templates

Starting with the app folder, create a file called main.tf:

data "aws_subnet_ids" "private" {
    vpc_id      = "${var.vpc_id}"    tags = {
        network = "private"
    }
}data "aws+subnet_ids" "public" {
    vpc_id      = "${var.vpc_id}"    tags = {
        network = "public"
    }
}/*==== Create ECS Task Definition =====*/data "template_file" "container_definition" {
    template           = "${file("${path.module}/templates/task_definition.json")}"    vars = {
       app_name        = local.app.app_name
       name            = "${var.env["environment"]}-${local.app.app_name}"
       prom_image      = local.app.prom_image
       alrt_image      = local.app.alrt_image
       app_cpu         = local.app.app_cpu
       app_memory      = local.app.app_memory
       awslogs-group   = "${var.env["environment"]}"
       awslogs-region  = "${var.env["region"]}"
    }
}resource "aws_ecs_task_definition" "app" {
  family                   = "${var.env["environment"]}-${var.app["app_name"]}"
  network_mode             = "${var.network_mode}"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "${var.app["fargate_cpu"]}"
  memory                   = "${var.app["fargate_memory"]}"
  execution_role_arn       = "${var.task-definition-role-arn}"
  task_role_arn            = "${var.task-definition-role-arn}"
  container_definitions    = "${data.template_file.container_definition.rendered}"
}/*==== Create ECS Service =====*/resource "aws_ecs_service" "main" {
  name                = "${var.env["environment"]}-${var.app["app_name"]}"
  cluster             = "${var.ecs_cluster}"
  task_definition     = "${aws_ecs_task_definition.app.family}:${aws_ecs_task_definition.app.revision}"
  desired_count       = "${var.app_count}"
  launch_type         = "FARGATE"  network_configuration {
    security_groups   = ["$var.worker-security-group-id}"]
    subnets           = "${data.aws_subnet_ids.private.ids}"
  }  load_balancer {
    target_group_arn  = "${aws_alb_target_group.server.arn}"
    container_name    = "${var.env["environment"]}-${var.app["app_name"]}"
    container_port    = 9090
  }  load_balancer {
    target_group_arn  = "${aws_alb_target_group.alertmanager.arn}"
    container_name    = "alertmanager"
    container_port    = 9093
  }  depends_on = [
    "aws_alb_listener.server",
  ]
}/*==== Create Route53 record =====*/data "aws_route53_zone" "hosted_zone" {
  name = "${var.hosted_zone}"
}resource "aws_route53_record" "frontend_alb_r53" {
  name       = "${var.app["route53_name"]}"
  depends_on = ["aws_alb.main"]
  zone_id    = "${data.aws_route53_zone.hosted_zone.zone_id}"
  type       = "A"  alias {
    name                   = "${aws_alb.main.dns_name}"
    zone_id                = "${aws_alb.main.zone_id}"
    evaluate_target_health = true
  }  lifecycle {
    create_before_destroy = true
  }
}/*==== Create Application Load Balancer =====*/resource "aws_alb" "main" {
  name               = "${var.env["environment"]}-${var.app["app_name"]}"
  load_balancer_type = "application"
  subnets            = "${data.aws_subnet_ids.public.ids}"
  security_groups    = ["$var.alb-security-group-id}"]  tags = "${merge(map("Name", format("%s-${var.app["app_name"]}-app-alb", var.tags["environment"])),var.tags)}"
}resource "aws_alb_target_group" "server" {
  name        = "${var.env["environment"]}-${var.app["app_name"]}"
  port        = "${var.app["app_port"]}"
  protocol    = "HTTP"
  target_type = "ip"
  vpc_id      = "${var.vpc_id}"
  health_check {
    matcher = "200,302"
  }
}resource "aws_alb_tagret_group" "alertmanager" {
  name        = "${var.env["environment"]}-alrtmgr"
  port        = 9093
  protocol    = "HTTP"
  target_type = "ip"
  vpc_id      = "${var.vpc_id}"
  health_check {
    matcher = "200,302"
  }
}resource "aws_alb_listener" "server" {
  load_balancer_arn    = "${aws_alb.main.id}"
  port                 = "${var.app["app_port"]}"
  protocol             = "HTTP"  default_action {
    target_group_arn = "${aws_alb_target_group.server.id}"
    type             = "forward"
  }
}resource "aws_alb_listener" "node" {
  load_balancer_arn = "${aws_alb.main.id}"
  port              = 9100
  protocol          = "HTTP"  default_action {
    target_group_arn = "${aws_alb_target_group.server.id}"
    type = "forward"
  }
}resource "aws_alb_listener" "alertmanager" {
  load_balancer_arn    = "${aws_alb.main.id}"
  port                 = 9093
  protocol             = "HTTP"  default_action {
    target_group_arn = "${aws_alb_target_group.alertmanager.id}"
    type = "forward"
  }
}

and finally the vars.tf:

variable "env" { type = "map"}
variable "tags" { type = "map"}variable "vpc_id {}
variable "app" { type = "map" }variable "app_port" {}
variable "ecs_cluster" {}variable "app_count" {}variable "worker-security-group-id" {}
variable "alb-security-group-id {}variable "task-definition-role-arn {}variable "fargate_cpu" {}
variable "fargate_memory" {}variable "network_mode" {}
variable "hosted_zone" {}locals {
    defaults = {
        ecr           = ""
        route53       = ""
        prom_image    = "<account>.dkr.ecr.<region>.amazonaws.com/prometheus:latest"
        alrt_image    = "<account>.dkr.ecr.<region>.amazonaws.com/alertmanager:latest"
        app_port      = 9090
        app_cpu       = 1024
        app_memory    = 2056
        hosted_zone   = "<enter hosted zone name>"
        network_mode  = "awsvpc"
        app_name      = "server"
    }
    app = merge(
        local.defaults,
        var.app
    )
}

Moving onto the final piece for the app module. Next, we need to create the task definition file. In the template subfolder, you need to create a file called task_definition.json:

[
   {
     "essential": true,
     "cpu": 128,
     "image": "${prom_image}",
     "memory": 128,
     "name": "${name}"
     "networkMode": "awsvpc",
     "portMappings": [
       {
         "containerPort": 9090,
         "hostPort": 9090,
         "protocol": "tcp"
       }
     ],
     "logConfiguration": {
       "logDriver": "awslogs",
       "options: {
         "awslogs-group": "/${awslogs-group}/prometheus",
         "awslogs-region": "${awslogs-region}",
         "awslogs-stream-prefix": "prometheus"
       }
     }
   ],
   {
     "essential": true,
     "cpu": 128,
     "image": "${alrt_image}"
     "memory": 64,
     "name": "alertmanager",
     "networkMode": "awsvpc",
     "portMappings": [
       {
         "containerPort": 9093,
         "hostPort": 9093,
         "protocol": "tcp"
       }
     ],
     "logConfiguration": {
       "logDriver": "awslogs",
       "options": {
         "awslogs-group": "/${awslogs-group}/alertmanager",
         "awslogs-region": "${awslogs-region}",
         "awslogs-stream-prefix": "alertmanager"
       }
     }
   ],
   {
     "essential": false,
     "cpu": 10,
     "image": "coveo/ecs-exporter",
     "memory": 64,
     "name": "ecs-exporter",
     "networkMode": "awsvpc",
     "command": ["-aws.region=<enter region>"],
     "portMappings": [
       {
         "containerport": 9222,
         "hostPort": 9222,
         "protocol": "tcp"
       }
      ]
     }
]

Bring the build together

The app module is now completed and at the moment uses a pre-built exporter image from https://hub.docker.com/r/coveo/ecs-exporter. The last part of the Prometheus deployment brings all the modules together for deployment. Create a folder called infrastructure in the top-level prometheus location with the final structure looking like this:

prometheus
   infrastructure
   docker
      alertmanager
      prometheus
   modules
      ecs
      security-groups
      app
         templates

Create a file called main.tf:

module "security-groups" {
  source           = "../modules/security-groups"
  env              = "${var.env}"
  tags             = "${var.tags}"
  vpc_id           = "${var.vpc_id}"
}module "ecs" {
  source           = "../modules/ECS"
  env              = "${var.env}"
  tags             = "${var.tags}"
  vpc_id           = "${var.vpc_id}"
}module "prometheus" {
  source           = "../modules/app"
  app              = "${var.prometheus}"
  env              = "${var.env}"
  tags             = "${var.tags}"
  vpc_id           = "${var.vpc_id}"
  fargate_cpu      = "${var.prometheus["fargate_cpu"]}"
  fargate_memory   = "${var.prometheus["fargate_memory"]}"
  app_count        = "${var.prometheus["app_count"]}"
  app_port         = "${var.prometheus["app_port"]}"
  network_mode     = "${var.prometheus["network_mode"]}"
  hosted_zone      = "${var.prometheus["hosted_zone"]}"
  ecs_cluster      = "${module.ecs.ecs_cluster}"
  task-definition-role-arn  = "${module.ecs.rask_definition_role_arn}"
  alb-security-group-id     = "${module.security-groups.alb-security-group-id}"
  worker-security-group-id  = "${module.security_groups.worker-security-group-id}"
}

and the variables.tf as follows:

variable "env" {
  type = map(string)
  default = {
    environment        = "prometheus-test"
    app                = "prometheus"
    region             = "<enter region>"
  }
}variable "tags" {
  type = map(string)
  default = {
    name               = "prometheus-test"
    terraform          = "0.12.9"
    description        = "Prometheus - build by me" 
  }
}variable "prometheus" {
  type = map(string)
  default = {
    prom_image     = "<account>.dkr.ecr.<region>.amazonaws.com/prometheus:latest"
    alrt_image     = "<account>.dkr.ecr.<region>.amazonaws.com/alertmanager:latest"
    app_name       = "server"
    route53_name   = "prometheus"
    app_port       = 9090
    fargate_cpu    = 1024
    fargate_memory = 2048
    app_count      = 1
    hosted_zone    = "<hosted_zone name>"
  }
}variable "vpc_id" { default = "<VPC>" }

Final thoughts

So the above build isn’t perfect in any way but it works and I’ve used it with some tweaks quite successfully in non-prod and production environments. I’ve been meaning to write this document for some time and a few things have changed in Terraform versions making it easier to achieve the same goals which I plan on implementing in the future.

The snippets of code that require your own values have been highlighted, with some of the values to be tweaked further to suit your needs. The jmx-exporter entry has been left in the Prometheus.yml file and before you can successfully scrape from the Node Exporter you need to add the prometheus worker security group (port 9100) to the instance you wish to monitor.

Thank you for taking the time to read this document.