目录
点击展开目录
介绍
- Prometheus负责时间序列数据的收集和存储。它会根据您的配置从导出器和其他端点拉取指标。该—web.enable-lifecycle标志允许您在不重启容器的情况下触发配置重新加载。
- Node Exporter从主机收集底层系统指标,例如 CPU 使用率、内存和磁盘统计信息。我们将其挂载/proc为/sys只读模式,以便 Prometheus 能够在不影响系统的情况下抓取准确的主机指标。
- Grafana基于 Prometheus 构建,并提供用户友好的界面来可视化您的数据。配置文件夹(datasources和dashboards)确保在首次运行时一切都会自动设置好。
- Alertmanager从 Prometheus 接收警报,并将其路由到正确的位置——Slack、PagerDuty、电子邮件等。从本地文件夹挂载配置,可以方便地根据警报需求的变化进行调整。
Docker 安装
# 一键安装 Docker
curl -fsSL https://get.docker.com | sh
# 查看 Docker 版本
docker --version
Docker version 29.2.1, build a5c7197
# 查看 Docker Compose 版本
docker compose version
Docker Compose version v5.0.2
安装指南
创建项目结构
mkdir monitoring
cd monitoring
mkdir -p prometheus/rules alertmanager grafana/provisioning/{datasources,dashboards}
- prometheus/rules/这里用于存储自定义警报和记录规则。
- alertmanager/将保存 Alertmanager 配置文件,包括路由和通知设置。
- grafana/provisioning/分为 和datasources/,dashboards/以支持自动化 Grafana 设置——这样您的仪表板和数据源就会在启动时自动加载。
创建 Prometheus 配置文件
mkdir prometheus
nano prometheus/prometheus.yml
配置 Prometheus
# 将以下内容添加到 prometheus/prometheus.yml 中
global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules once and periodically evaluate them
rule_files:
- "rules/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself changes slowly, scrape less frequently
- job_name: 'prometheus'
scrape_interval: 30s
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'prometheus-server'
service: 'self-monitoring'
# System metrics change frequently, scrape more often
- job_name: 'node-exporter'
scrape_interval: 20s
static_configs:
- targets: ['node-exporter:9100'] # localhost
labels:
instance: 'localhost'
location: 'https://localhost.com/'
设置警报规则
{% raw %}
# 创建告警规则文件prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
rules:
- alert: HighCPULoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load (instance {{ $labels.instance }})"
description: "CPU load is > 80%\n VALUE = {{ $value }}%\n LABELS: {{ $labels }}"
- alert: HighMemoryLoad
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory load (instance {{ $labels.instance }})"
description: "Memory load is > 80%\n VALUE = {{ $value }}%\n LABELS: {{ $labels }}"
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes{fstype=~"ext4|xfs"} - node_filesystem_free_bytes{fstype=~"ext4|xfs"}) / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage (instance {{ $labels.instance }})"
description: "Disk usage is > 85%\n VALUE = {{ $value }}%\n LABELS: {{ $labels }}"
# Add this to your alert rules file
- alert: UnusualMemoryGrowth
expr: deriv(node_memory_MemAvailable_bytes[30m]) < -10 * 1024 * 1024
for: 10m
labels:
severity: warning
annotations:
summary: "Unusual memory consumption rate (instance {{ $labels.instance }})"
description: "Memory is being consumed at a rate of more than 10MB/min\n VALUE = {{ $value | humanize }}B/s"
{% endraw %}
配置 Alertmanager
{% raw %}
# 创建基本的 Alertmanager 配置alertmanager/config.yml
* 默认使用 telegram 接收告警通知,你可以改成邮件或其他
* 记得修改 chat_id 和 bot_token,换成你自己的ID
* 根据需求可修改 message 参数下的信息
* 可适当修改alertmanager.yml 如:管理告警的路由、去重、分组和通知等操作
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'telegram'
receivers:
- name: 'telegram'
telegram_configs:
- chat_id: xxx
bot_token: 'xxx'
message: |
{{ if eq .Status "firing" }}
🚨 <b>严重告警</b> 🚨
⚠️ <b>告警名称:</b> {{ .CommonLabels.alertname | html }} (共 {{ .Alerts | len }} 个实例)
{{ range .Alerts }}
----------------------------------------
⚠️ <b> IP地址:</b> {{ .Labels.ip | html }}
⚠️ <b>数据中心:</b> {{ .Labels.location | html }}
⚠️ <b>详细描述:</b> {{ .Annotations.description | html }}
⚠️ <b>通知时间:</b> {{ .StartsAt.Format "2006-01-02 15:04:05" }}
----------------------------------------
{{ end }}
{{ else }}
✅ <b>严重告警恢复</b> ✅
<b>告警名称:</b> {{ .CommonLabels.alertname | html }} (共 {{ .Alerts | len }} 个实例)
{{ range .Alerts }}
----------------------------------------
<b> IP地址:</b> {{ .Labels.ip | html }}
<b>数据中心:</b> {{ .Labels.location | html }}
<b>详细描述:</b> {{ .Annotations.description | html }}
<b>恢复时间:</b> {{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}
parse_mode: 'HTML'
send_resolved: true
{% endraw %}
基于团队的告警路由(可选)
- 可以根据服务和严重程度将警报路由到不同的团队
route:
# Default receiver
receiver: 'operations-team'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
# Specific routing rules
routes:
- match:
severity: critical
receiver: 'pager-duty'
repeat_interval: 1h
continue: true
- match_re:
service: database|redis|elasticsearch
receiver: 'database-team'
- match_re:
service: frontend|api
receiver: 'application-team'
receivers:
- name: 'operations-team'
email_configs:
- to: 'ops@example.com'
- name: 'pager-duty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
- name: 'database-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR_KEY'
channel: '#db-alerts'
- name: 'application-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR_KEY'
channel: '#app-alerts'
设置 Grafana 仪表盘
- 配置 Grafana 自动连接到 Prometheus,方法是创建
grafana/provisioning/datasources/datasource.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
创建 Docker Compose 文件
# 配置文件 docker-compose.yml
version: '3.8'
volumes:
prometheus_data: {}
grafana_data: {}
networks:
monitoring:
driver: bridge
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
networks:
- monitoring
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)'
ports:
- "9100:9100"
networks:
- monitoring
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
networks:
- monitoring
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
volumes:
- ./alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
networks:
- monitoring
restart: unless-stopped
启动监控
docker compose up -d
docker ps
访问监控界面
- Prometheus:http://localhost:9090
- Grafana:http://localhost:3000(使用 admin/admin 登录)
- Alertmanager:http://localhost:9093
添加自己喜欢的面板
- 例如:22969