Docker 部署 Prometheus AlertManager 和 Grafana

介绍

Prometheus负责时间序列数据的收集和存储。它会根据您的配置从导出器和其他端点拉取指标。该—web.enable-lifecycle标志允许您在不重启容器的情况下触发配置重新加载。
Node Exporter从主机收集底层系统指标，例如 CPU 使用率、内存和磁盘统计信息。我们将其挂载/proc为/sys只读模式，以便 Prometheus 能够在不影响系统的情况下抓取准确的主机指标。
Grafana基于 Prometheus 构建，并提供用户友好的界面来可视化您的数据。配置文件夹（datasources和dashboards）确保在首次运行时一切都会自动设置好。
Alertmanager从 Prometheus 接收警报，并将其路由到正确的位置——Slack、PagerDuty、电子邮件等。从本地文件夹挂载配置，可以方便地根据警报需求的变化进行调整。

Docker 安装

# 一键安装 Docker
curl -fsSL https://get.docker.com | sh

# 查看 Docker 版本
docker --version
Docker version 29.2.1, build a5c7197

# 查看 Docker Compose 版本
docker compose version
Docker Compose version v5.0.2

安装指南

创建项目结构

mkdir monitoring
cd monitoring

mkdir -p prometheus/rules alertmanager grafana/provisioning/{datasources,dashboards}

prometheus/rules/这里用于存储自定义警报和记录规则。
alertmanager/将保存 Alertmanager 配置文件，包括路由和通知设置。
grafana/provisioning/分为和datasources/，dashboards/以支持自动化 Grafana 设置——这样您的仪表板和数据源就会在启动时自动加载。

创建 Prometheus 配置文件

mkdir prometheus
nano prometheus/prometheus.yml

配置 Prometheus

# 将以下内容添加到 prometheus/prometheus.yml 中

global:
  scrape_interval: 15s
  evaluation_interval: 15s

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

# Load rules once and periodically evaluate them
rule_files:
  - "rules/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself changes slowly, scrape less frequently
  - job_name: 'prometheus'
    scrape_interval: 30s
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: 'prometheus-server'
          service: 'self-monitoring'

  # System metrics change frequently, scrape more often
  - job_name: 'node-exporter'
    scrape_interval: 20s
    static_configs:
      - targets: ['node-exporter:9100'] # localhost
        labels:
          instance: 'localhost'
          location: 'https://localhost.com/'

设置警报规则

{% raw %}

# 创建告警规则文件prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
  rules:
  - alert: HighCPULoad
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load (instance {{ $labels.instance }})"
      description: "CPU load is > 80%\n  VALUE = {{ $value }}%\n  LABELS: {{ $labels }}"
      
  - alert: HighMemoryLoad
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory load (instance {{ $labels.instance }})"
      description: "Memory load is > 80%\n  VALUE = {{ $value }}%\n  LABELS: {{ $labels }}"
      
  - alert: HighDiskUsage
    expr: (node_filesystem_size_bytes{fstype=~"ext4|xfs"} - node_filesystem_free_bytes{fstype=~"ext4|xfs"}) / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High disk usage (instance {{ $labels.instance }})"
      description: "Disk usage is > 85%\n  VALUE = {{ $value }}%\n  LABELS: {{ $labels }}"

  # Add this to your alert rules file
  - alert: UnusualMemoryGrowth
    expr: deriv(node_memory_MemAvailable_bytes[30m]) < -10 * 1024 * 1024
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Unusual memory consumption rate (instance {{ $labels.instance }})"
      description: "Memory is being consumed at a rate of more than 10MB/min\n  VALUE = {{ $value | humanize }}B/s"

{% endraw %}

配置 Alertmanager

{% raw %}

# 创建基本的 Alertmanager 配置alertmanager/config.yml
* 默认使用 telegram 接收告警通知，你可以改成邮件或其他
* 记得修改 chat_id 和 bot_token，换成你自己的ID
* 根据需求可修改 message 参数下的信息
* 可适当修改alertmanager.yml 如：管理告警的路由、去重、分组和通知等操作

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'telegram'

receivers:
  - name: 'telegram'
    telegram_configs:
      - chat_id: xxx
        bot_token: 'xxx'
        message: |
          {{ if eq .Status "firing" }}
          🚨 <b>严重告警</b> 🚨
          ⚠️ <b>告警名称:</b> {{ .CommonLabels.alertname | html }} (共 {{ .Alerts | len }} 个实例)
          {{ range .Alerts }}
          ----------------------------------------
          ⚠️ <b> IP地址:</b> {{ .Labels.ip | html }}
          ⚠️ <b>数据中心:</b> {{ .Labels.location | html }}
          ⚠️ <b>详细描述:</b> {{ .Annotations.description | html }}
          ⚠️ <b>通知时间:</b> {{ .StartsAt.Format "2006-01-02 15:04:05" }}
          ----------------------------------------
          {{ end }}
          {{ else }}
          ✅ <b>严重告警恢复</b> ✅
          <b>告警名称:</b> {{ .CommonLabels.alertname | html }} (共 {{ .Alerts | len }} 个实例)
          {{ range .Alerts }}
          ----------------------------------------
          <b> IP地址:</b> {{ .Labels.ip | html }}
          <b>数据中心:</b> {{ .Labels.location | html }}
          <b>详细描述:</b> {{ .Annotations.description | html }}
          <b>恢复时间:</b> {{ .EndsAt.Format "2006-01-02 15:04:05" }}
          {{ end }}
          {{ end }}
        parse_mode: 'HTML'
        send_resolved: true

{% endraw %}

基于团队的告警路由（可选）

可以根据服务和严重程度将警报路由到不同的团队

route:
  # Default receiver
  receiver: 'operations-team'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  # Specific routing rules
  routes:
  - match:
      severity: critical
    receiver: 'pager-duty'
    repeat_interval: 1h
    continue: true
    
  - match_re:
      service: database|redis|elasticsearch
    receiver: 'database-team'
    
  - match_re:
      service: frontend|api
    receiver: 'application-team'

receivers:
- name: 'operations-team'
  email_configs:
  - to: 'ops@example.com'
    
- name: 'pager-duty'
  pagerduty_configs:
  - service_key: 'your-pagerduty-key'
    
- name: 'database-team'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR_KEY'
    channel: '#db-alerts'
    
- name: 'application-team'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR_KEY'
    channel: '#app-alerts'

设置 Grafana 仪表盘

配置 Grafana 自动连接到 Prometheus，方法是创建
grafana/provisioning/datasources/datasource.yml

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

创建 Docker Compose 文件

# 配置文件 docker-compose.yml

version: '3.8'

volumes:
  prometheus_data: {}
  grafana_data: {}

networks:
  monitoring:
    driver: bridge

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    networks:
      - monitoring
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)'
    ports:
      - "9100:9100"
    networks:
      - monitoring
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    networks:
      - monitoring
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/config.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    networks:
      - monitoring
    restart: unless-stopped

启动监控

docker compose up -d
docker ps

访问监控界面

Prometheus：http://localhost:9090
Grafana：http://localhost:3000（使用 admin/admin 登录）
Alertmanager：http://localhost:9093

添加自己喜欢的面板

例如：22969

Docker 部署 Prometheus AlertManager 和 Grafana

目录

介绍

Docker 安装

安装指南

创建项目结构

创建 Prometheus 配置文件

配置 Prometheus

设置警报规则

配置 Alertmanager

基于团队的告警路由（可选）

设置 Grafana 仪表盘

创建 Docker Compose 文件

启动监控

访问监控界面

添加自己喜欢的面板