1 安装 agent

指标说明

名字 暴露信息
rocketmq_producer_tps 每秒每个主题生成的消息数
rocketmq_producer_message_size 主题每秒生成的消息的大小(字节)
rocketmq_producer_offset 主题生成消息的进度
rocketmq_consumer_tps 消费者群体每秒消耗的消息数
rocketmq_consumer_message_size 消费者群体每秒消耗的消息大小(字节)
rocketmq_consumer_offset 消费群体消费信息的进展
rocketmq_group_get_latency 消费者延迟对一个队列的某个主题
rocketmq_group_get_latency_by_storetime 消费群体的消费延迟时间
rocketmq_message_accumulation 消费者抵消滞后程度
rocketmq_client_consume_fail_msg_count 消耗的消息数量在一小时内失败
rocketmq_client_consume_fail_msg_tps 消耗的消息数量每秒失败
rocketmq_client_consume_ok_msg_tps 每秒消耗成功的消息数
rocketmq_client_consume_rt 消耗每条消息的平均时间
rocketmq_client_consumer_pull_rt 拉每个消息的平均时间
rocketmq_client_consumer_pull_tps 客户端每秒提取的消息数

1.1 下载插件

Apache RocketMQ Prometheus Exporter

1.2 修改配置

vim src/main/resources/application.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
rocketmq:
config:
webTelemetryPath: /metrics
rocketmqVersion: 4_8_0
namesrvAddr: 127.0.0.1:9876 #地址
enableCollect: true
enableACL: false # if >=4.4.0
accessKey: # if >=4.4.0
secretKey: # if >=4.4.0
#检测频率
task:
count: 5 # num of scheduled-tasks
collectTopicOffset:
cron: 30 0/1 * * * ?
collectConsumerOffset:
cron: 30 0/1 * * * ?
collectBrokerStatsTopic:
cron: 30 0/1 * * * ?
collectBrokerStats:
cron: 30 0/1 * * * ?
collectBrokerRuntimeStats:
cron: 30 0/1 * * * ?

1.3 打包

mvn clean install

1.4 启动

nohup /opt/java8/bin/java -jar rocketmq-exporter-0.0.2-SNAPSHOT.jar &

2 收集数据与监控

kubectl edit configmap prometheus-server -n ops

2.1 配置 prometheus 收集任务

1
2
3
4
5
scrape_configs:
- job_name: rocketmq
static_configs:
- targets:
- 172.16.3.17:5557 #rocketmq-exporte地址

2.2 配置监控策略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
rules: |-
groups:
- name: rocketmq
rules:
- alert: RocketMQ Exporter is Down
expr: up{job="rocketmq"} == 0
for: 20s
labels:
severity: '灾难'
annotations:
summary: RocketMQ {{ $labels.instance }} is down
- alert: RocketMQ 存在消息积压
expr: (sum(irate(rocketmq_producer_offset[1m])) by (topic) - on(topic) group_right sum(irate(rocketmq_consumer_offset[1m])) by (group,topic)) > 5
for: 5m
labels:
severity: '警告'
annotations:
summary: RocketMQ (group={{ $labels.group }} topic={{ $labels.topic }})积压数 = {{ .Value }}
- alert: GroupGetLatencyByStoretime 消费组的消费延时时间过高
expr: rocketmq_group_get_latency_by_storetime/1000 > 5 and rate(rocketmq_group_get_latency_by_storetime[5m]) >0
for: 3m
labels:
severity: 警告
annotations:
description: 'consumer {{$labels.group}} on {{$labels.broker}}, {{$labels.topic}} consume time lag behind message store time
and (behind value is {{$value}}).'
summary: 消费组的消费延时时间过高
- alert: RocketMQClusterProduceHigh 集群TPS > 20
expr: sum(rocketmq_producer_tps) by (cluster) >= 20
for: 3m
labels:
severity: 警告
annotations:
description: '{{$labels.cluster}} Sending tps too high. now TPS = {{ .Value }}'
summary: cluster send tps too high

2.3 配置 grafana

导入模板 10477

image-20211008150201515

3 配置告警

3.1 下载插件

prometheus-webhook-dingtalk

3.2 钉钉插件配置

cat config.yml

1
2
3
4
5
6
7
8
templates:
\- /opt/prometheus-webhook-dingtalk/template.tmpl #模板位置
targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=ac5f4916af10804b1aeffe9f5f45574a9af8e7cdd8436bcf1dc2448a85116fba #钉钉url
secret: SEC6f3e3e736f33a8f8692e3f4f9e1c0828ac41fc514c99c5215fd21659bxxxx #钉钉加密字段
mention:
mobiles: ['1810133xxxx', '1871712xxxx']

cat template.tmpl

1
2
3
4
5
6
7
8
9
10
11
{{ define "ding.link.title" }}{{ template "legacy.title" . }}{{ end }}
{{ define "ding.link.content" }}
{{ if gt (len .Alerts.Firing) 0 -}}
告警列表:
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
恢复列表:
{{ template "__text_resolve_list" .Alerts.Resolved }}
{{- end }}
{{- end }}

3.3 启动 prometheus-webhook-dingtalk

/opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --log.level=info > dingding.log 2>&1 &

3.3.1 查看钉钉插件接口

image-20211008150758352

3.4 prometheus 告警配置

kubectl edit configmap prometheus-alertmanager -n ops

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
alertmanager.yml: |-
global:
resolve_timeout: 5m
route:
receiver: webhook
group_wait: 30s
group_interval: 1m
repeat_interval: 4h
group_by: [alertname]
routes:
- receiver: webhook
group_wait: 10s
receivers:
- name: webhook
webhook_configs:
- url: http://172.16.3.1x:8060/dingtalk/webhook1/send
send_resolved: true