GPU Metrics Exporter — Explained

A plain-English guide to what this project does, how data flows, and when to use it.

Ansible Fundamentals & How It Relates

Ansible is an automation tool that configures remote servers (or "hosts") by connecting over SSH and running tasks defined in YAML. No agent is installed on targets — Ansible pushes modules, runs them, and collects results.

Key Concepts

How This Project Uses Ansible

This repo is Ansible-based: a playbook runs four roles in sequence on each GPU node. Ansible deploys exporters, Alloy config, and systemd services — no manual SSH commands needed. You run one command; every host gets the full monitoring stack.

What Is This?

GPU Metrics Exporter is an Ansible-based deployment system that:

  1. Collects NVIDIA GPU metrics (utilization, memory, temp, power) and host metrics (CPU, RAM, disk, network)
  2. Forwards them to a central stack (Grafana Alloy → Mimir for metrics, Loki for logs)
  3. Visualizes them in Grafana for monitoring, capacity planning, and troubleshooting

Think of it as: automated deployment + monitoring for GPU fleets.

Simulation 1: High-Level Data Flow

┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                               GPU NODE (each server with GPUs)                               │
│                                                                                              │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────────────────────────┐   │
│   │ nvidia-smi  │   │   NVIDIA    │   │    Host     │   │                                 │   │
│   │   (drivers) │──▶│  GPU Info   │   │   Metrics   │   │   Grafana Alloy Agent           │   │
│   └─────────────┘   └──────┬──────┘   │ CPU,RAM,    │   │   (scrapes, adds labels,        │   │
│                            │          │ disk, net   │   │    remote writes)               │   │
│   ┌────────────────────────┼──────────┼──────┬──────┼───┤                                 │   │
│   │  dcgm-exporter OR      │          │      │      │   │   instance, datacenter,         │   │
│   │  nvidia_gpu_exporter   │◀─────────┘      │      │   │   cluster, exporter_mode       │   │
│   │  (port 9400)           │                 │      │   │                                 │   │
│   │  node_exporter         │◀────────────────┘      │   │                                 │   │
│   │  (port 9100)           │                         │   │                                 │   │
│   └───────────────────────┼─────────────────────────┘   └─────────────────────────────────┘   │
└───────────────────────────┼───────────────────────────────────────────────────────────────────┘
                            │  Remote Write (HTTPS)
                            ▼
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│   Grafana Alloy ──▶ Mimir (metrics)   │   Loki (logs: XID errors, Docker)   │   Grafana      │
└─────────────────────────────────────────────────────────────────────────────────────────────┘

Example: Verify metrics on a node

# On a deployed GPU node — check GPU metrics endpoint
curl http://127.0.0.1:9400/metrics | head -30

# Check node (host) metrics
curl http://127.0.0.1:9100/metrics | head -20

Simulation 2: Deployment Flow (Ansible Roles)

                    ┌─────────────────────┐
                    │   You run Ansible   │
                    │   playbook (push    │
                    │   or pull mode)     │
                    └──────────┬──────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────────────────┐
│  ROLE 1: nvidia_driver_check                                          │
│  • Checks nvidia-smi exists  • Counts GPUs  • Fails early if no driver │
└──────────────────────────┬───────────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  ROLE 2: gpu_exporter (auto-selects)                                  │
│  Container? → nvidia_gpu_exporter  │  Bare-metal/VM? → dcgm-exporter   │
└──────────────────────────┬───────────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  ROLE 3: node_exporter                                                │
│  • CPU, memory, disk, network metrics                                 │
└──────────────────────────┬───────────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  ROLE 4: alloy_agent                                                  │
│  • Scrape GPU + node  • Add labels  • Remote write to Mimir/Loki      │
└──────────────────────────┬───────────────────────────────────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │  All done!   │
                    └──────────────┘

Example: Deploy with Ansible

# Full deployment (push mode)
ansible-playbook -i inventory/hosts.yml playbooks/gpu_metrics.yml --ask-vault-pass

# Dry run — preview changes only
ansible-playbook -i inventory/hosts.yml playbooks/gpu_metrics.yml --check --diff

# Deploy to one host only
ansible-playbook -i inventory/hosts.yml playbooks/gpu_metrics.yml --limit gpu-node-001 --ask-vault-pass

Simulation 3: Exporter Selection Logic

                         ┌─────────────────────────┐
                         │ systemd-detect-virt     │
                         │ Check DMI, /.dockerenv  │
                         └────────────┬────────────┘
                                      │
              ┌───────────────────────┴───────────────────────┐
              │                                               │
              ▼                                               ▼
    ┌─────────────────┐                             ┌─────────────────┐
    │ CONTAINER HOST  │                             │ BARE-METAL / VM │
    │ docker, lxc,    │                             │ (none, kvm,     │
    │ podman, etc.    │                             │  vmware, etc.)  │
    └────────┬────────┘                             └────────┬────────┘
             │                                               │
             ▼                                               ▼
    ┌─────────────────┐                             ┌─────────────────┐
    │ nvidia_gpu_     │                             │ Docker installed│
    │ exporter        │                             │ and running?    │
    │ (binary)        │                             └────────┬────────┘
    └─────────────────┘                          ┌──────────┴──────────┐
                                                 ▼                     ▼
                                         ┌──────────────┐      ┌──────────────┐
                                         │ Yes          │      │ No → install │
                                         └──────┬───────┘      │ Docker+CTK   │
                                                │              └──────┬───────┘
                                                └──────────┬──────────┘
                                                           ▼
                                                  ┌─────────────────┐
                                                  │ dcgm-exporter   │
                                                  │ (container)     │
                                                  └─────────────────┘

Example: Inventory for different host types

# inventory/hosts.yml — bare-metal/VM (gets dcgm-exporter)
training_cluster:
  vars:
    datacenter: us-west
    cluster_name: training
  hosts:
    gpu-node-001:
      ansible_host: 10.0.1.1

# Container host (gets nvidia_gpu_exporter) — same playbook, auto-detects
container_hosts:
  hosts:
    nested-gpu-node:
      ansible_host: 192.168.1.10
      # Playbook detects container env → uses nvidia_gpu_exporter

Deployment Modes

ModeHow it runsBest for
PushYou run ansible-playbook from a control machineCentralized management, CI/CD, AWX
PullEach node runs ansible-pull (script or systemd timer)Self-provisioning, air-gapped

Use Cases

Use caseDescription
AI/ML training fleetTrack GPU utilization, memory, power across clusters
Cost / capacity planningUse utilization trends to add or decommission nodes
GPU fault detectionXID errors in Loki + alerts when GPUs fail
Inference / servingMonitor TTS, STT, model-serving; combine GPU metrics + Docker logs
Multi-datacenterCompare by datacenter and cluster labels
Onboarding new nodesAdd host to inventory, run playbook — done

Quick Reference

# Push: full deploy
ansible-playbook -i inventory/hosts.yml playbooks/gpu_metrics.yml --ask-vault-pass

# Pull: one-liner (replace YOUR_ORG, URLs, credentials)
curl -sSL https://raw.githubusercontent.com/YOUR_ORG/gpu-metrics-exporter/main/scripts/install-ansible-pull.sh | bash -s -- \
  --repo https://github.com/YOUR_ORG/gpu-metrics-exporter.git \
  --datacenter us-west --cluster training \
  --remote-write https://mimir.example.com/api/v1/push \
  --username writer --password secret

# Local testing (no real GPUs)
cd docker && docker compose up -d
# Grafana: http://localhost:3000 (admin/admin)

GPU Metrics Exporter — Ansible-based deployment for NVIDIA GPU monitoring.