集群节点数量过多,经常会遇到某个节点访问出现问题,却不能及时发现,goldpinger就是为了解决此问题诞生的,与其他服务相比,goldpinger更加轻量
此文档通过argo cd管理goldpinger
尝试过Weave Scope,发现使用资源过多,参考Kubernetes拓扑可视化监控之Scope

参考文档: goldpinger

一, 部署

1 准备配置文件

如果使用argo需要存放到git上,参考 [[../DevOps/ArgoCD#2 argocd-server 配置与使用|ArgoCD]]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: goldpinger-serviceaccount
namespace: goldpinger

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: goldpinger
namespace: goldpinger
annotations:
link.argocd.argoproj.io/external-link: http://10.122.167.173:30080 #argocd显示跳转的链接
labels:
app: goldpinger
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app: goldpinger
template:
metadata:
labels:
app: goldpinger
spec:
serviceAccount: "goldpinger-serviceaccount"
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
containers:
- name: goldpinger
env:
- name: HOST
value: "0.0.0.0"
- name: PORT
value: "8080"
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: HOSTS_TO_RESOLVE
value: "www.baidu.com ccops.cc kubernetes.default" # 要检测解析域名
- name: HTTP_TARGETS
value: http://ccops.cc # 要检测访问域名
- name: TCP_TARGETS
value: 8.8.8.8:53 # 要检测使用的dns服务器
image: "docker.io/bloomberg/goldpinger:v3.7.0"
imagePullPolicy: Always
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
resources:
limits:
memory: 80Mi
requests:
cpu: 1m
memory: 40Mi
ports:
- containerPort: 8080
name: http
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 20
periodSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 20
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: goldpinger
namespace: goldpinger
annotations:
link.argocd.argoproj.io/external-link: http://10.122.167.173:30080
labels:
app: goldpinger
spec:
type: NodePort
ports:
- port: 8080
name: http
nodePort: 30080 # 暴露的端口
selector:
app: goldpinger
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: goldpinger-clusterrole
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- list # 只给查看pod的权限

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: goldpinger-clusterrolebinding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: goldpinger-clusterrole
subjects:
- kind: ServiceAccount
name: goldpinger-serviceaccount
namespace: goldpinger

2 argocd同步

这里直接查看状态了
image.png

3 查看goldpinger

点击这里就会跳转到annotations 里配的地址
image.png
可以看到已经有拓扑了
image.png

监控与告警

1 prometheus采集

1
2
3
4
- job_name: 'goldpinger' 
metrics_path: '/metrics'
static_configs:
- targets: ["10.1.1.1:30080"]

2 告警规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
rules]# cat goldpinger.yml
groups:
- name: goldpinger
rules:
- alert: goldpinger_nodes_unhealthy
expr: sum(goldpinger_nodes_health_total{status="unhealthy"})
BY (instance, goldpinger_instance) > 0
for: 5m
annotations:
description: |
Goldpinger instance {{ $labels.goldpinger_instance }} has been reporting unhealthy nodes for at least 5 minutes.
summary: 'Instance {{ $labels.instance }} down'
- alert: goldpinger_http_unhealthy
expr: sum(goldpinger_http_errors_total) BY (instance, goldpinger_instance, host) > 0
for: 1m
annotations:
description: |
Goldpinger instance {{ $labels.goldpinger_instance }} unable to access {{ $labels.host }} for at least 1 minutes.
summary: '{{ $labels.goldpinger_instance }} unable to access {{ $labels.host }}'
- alert: goldpinger_tcp_unhealthy
expr: sum(goldpinger_tcp_errors_total) BY (instance, goldpinger_instance, host) > 1
for: 1m
annotations:
description: |
Goldpinger instance {{ $labels.goldpinger_instance }} unable to access {{ $labels.host }} for at least 1 minutes.
summary: '{{ $labels.goldpinger_instance }} unable to access {{ $labels.host }}'

3 grafana

导入此模板

3.1 查看

image.png