A plain-English guide to what this project does, how data flows, and when to use it.
Ansible is an automation tool that configures remote servers (or "hosts") by connecting over SSH and running tasks defined in YAML. No agent is installed on targets — Ansible pushes modules, runs them, and collects results.
inventory/hosts.yml) with IPs, groups, variablesdatacenter, alloy_remote_write_url; can come from inventory, group_vars, or command lineThis repo is Ansible-based: a playbook runs four roles in sequence on each GPU node. Ansible deploys exporters, Alloy config, and systemd services — no manual SSH commands needed. You run one command; every host gets the full monitoring stack.
GPU Metrics Exporter is an Ansible-based deployment system that:
Think of it as: automated deployment + monitoring for GPU fleets.
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ GPU NODE (each server with GPUs) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────────┐ │
│ │ nvidia-smi │ │ NVIDIA │ │ Host │ │ │ │
│ │ (drivers) │──▶│ GPU Info │ │ Metrics │ │ Grafana Alloy Agent │ │
│ └─────────────┘ └──────┬──────┘ │ CPU,RAM, │ │ (scrapes, adds labels, │ │
│ │ │ disk, net │ │ remote writes) │ │
│ ┌────────────────────────┼──────────┼──────┬──────┼───┤ │ │
│ │ dcgm-exporter OR │ │ │ │ │ instance, datacenter, │ │
│ │ nvidia_gpu_exporter │◀─────────┘ │ │ │ cluster, exporter_mode │ │
│ │ (port 9400) │ │ │ │ │ │
│ │ node_exporter │◀────────────────┘ │ │ │ │
│ │ (port 9100) │ │ │ │ │
│ └───────────────────────┼─────────────────────────┘ └─────────────────────────────────┘ │
└───────────────────────────┼───────────────────────────────────────────────────────────────────┘
│ Remote Write (HTTPS)
▼
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ Grafana Alloy ──▶ Mimir (metrics) │ Loki (logs: XID errors, Docker) │ Grafana │
└─────────────────────────────────────────────────────────────────────────────────────────────┘# On a deployed GPU node — check GPU metrics endpoint
curl http://127.0.0.1:9400/metrics | head -30
# Check node (host) metrics
curl http://127.0.0.1:9100/metrics | head -20
┌─────────────────────┐
│ You run Ansible │
│ playbook (push │
│ or pull mode) │
└──────────┬──────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ ROLE 1: nvidia_driver_check │
│ • Checks nvidia-smi exists • Counts GPUs • Fails early if no driver │
└──────────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ ROLE 2: gpu_exporter (auto-selects) │
│ Container? → nvidia_gpu_exporter │ Bare-metal/VM? → dcgm-exporter │
└──────────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ ROLE 3: node_exporter │
│ • CPU, memory, disk, network metrics │
└──────────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ ROLE 4: alloy_agent │
│ • Scrape GPU + node • Add labels • Remote write to Mimir/Loki │
└──────────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────┐
│ All done! │
└──────────────┘# Full deployment (push mode)
ansible-playbook -i inventory/hosts.yml playbooks/gpu_metrics.yml --ask-vault-pass
# Dry run — preview changes only
ansible-playbook -i inventory/hosts.yml playbooks/gpu_metrics.yml --check --diff
# Deploy to one host only
ansible-playbook -i inventory/hosts.yml playbooks/gpu_metrics.yml --limit gpu-node-001 --ask-vault-pass
┌─────────────────────────┐
│ systemd-detect-virt │
│ Check DMI, /.dockerenv │
└────────────┬────────────┘
│
┌───────────────────────┴───────────────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ CONTAINER HOST │ │ BARE-METAL / VM │
│ docker, lxc, │ │ (none, kvm, │
│ podman, etc. │ │ vmware, etc.) │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ nvidia_gpu_ │ │ Docker installed│
│ exporter │ │ and running? │
│ (binary) │ └────────┬────────┘
└─────────────────┘ ┌──────────┴──────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Yes │ │ No → install │
└──────┬───────┘ │ Docker+CTK │
│ └──────┬───────┘
└──────────┬──────────┘
▼
┌─────────────────┐
│ dcgm-exporter │
│ (container) │
└─────────────────┘# inventory/hosts.yml — bare-metal/VM (gets dcgm-exporter)
training_cluster:
vars:
datacenter: us-west
cluster_name: training
hosts:
gpu-node-001:
ansible_host: 10.0.1.1
# Container host (gets nvidia_gpu_exporter) — same playbook, auto-detects
container_hosts:
hosts:
nested-gpu-node:
ansible_host: 192.168.1.10
# Playbook detects container env → uses nvidia_gpu_exporter
| Mode | How it runs | Best for |
|---|---|---|
| Push | You run ansible-playbook from a control machine | Centralized management, CI/CD, AWX |
| Pull | Each node runs ansible-pull (script or systemd timer) | Self-provisioning, air-gapped |
| Use case | Description |
|---|---|
| AI/ML training fleet | Track GPU utilization, memory, power across clusters |
| Cost / capacity planning | Use utilization trends to add or decommission nodes |
| GPU fault detection | XID errors in Loki + alerts when GPUs fail |
| Inference / serving | Monitor TTS, STT, model-serving; combine GPU metrics + Docker logs |
| Multi-datacenter | Compare by datacenter and cluster labels |
| Onboarding new nodes | Add host to inventory, run playbook — done |
# Push: full deploy
ansible-playbook -i inventory/hosts.yml playbooks/gpu_metrics.yml --ask-vault-pass
# Pull: one-liner (replace YOUR_ORG, URLs, credentials)
curl -sSL https://raw.githubusercontent.com/YOUR_ORG/gpu-metrics-exporter/main/scripts/install-ansible-pull.sh | bash -s -- \
--repo https://github.com/YOUR_ORG/gpu-metrics-exporter.git \
--datacenter us-west --cluster training \
--remote-write https://mimir.example.com/api/v1/push \
--username writer --password secret
# Local testing (no real GPUs)
cd docker && docker compose up -d
# Grafana: http://localhost:3000 (admin/admin)
GPU Metrics Exporter — Ansible-based deployment for NVIDIA GPU monitoring.