集群节点数量过多,经常会遇到某个节点访问出现问题,却不能及时发现,goldpinger
就是为了解决此问题诞生的,与其他服务相比,goldpinger
更加轻量
此文档通过argo cd管理goldpinger
尝试过Weave Scope
,发现使用资源过多,参考Kubernetes拓扑可视化监控之Scope
参考文档: goldpinger
一, 部署
1 准备配置文件
如果使用argo需要存放到git上,参考 [[../DevOps/ArgoCD#2 argocd-server 配置与使用|ArgoCD]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124
| --- apiVersion: v1 kind: ServiceAccount metadata: name: goldpinger-serviceaccount namespace: goldpinger
--- apiVersion: apps/v1 kind: DaemonSet metadata: name: goldpinger namespace: goldpinger annotations: link.argocd.argoproj.io/external-link: http://10.122.167.173:30080 labels: app: goldpinger spec: updateStrategy: type: RollingUpdate selector: matchLabels: app: goldpinger template: metadata: labels: app: goldpinger spec: serviceAccount: "goldpinger-serviceaccount" securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 2000 containers: - name: goldpinger env: - name: HOST value: "0.0.0.0" - name: PORT value: "8080" - name: HOSTNAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP - name: HOSTS_TO_RESOLVE value: "www.baidu.com ccops.cc kubernetes.default" - name: HTTP_TARGETS value: http://ccops.cc - name: TCP_TARGETS value: 8.8.8.8:53 image: "docker.io/bloomberg/goldpinger:v3.7.0" imagePullPolicy: Always securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true resources: limits: memory: 80Mi requests: cpu: 1m memory: 40Mi ports: - containerPort: 8080 name: http readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 20 periodSeconds: 5 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 20 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: goldpinger namespace: goldpinger annotations: link.argocd.argoproj.io/external-link: http://10.122.167.173:30080 labels: app: goldpinger spec: type: NodePort ports: - port: 8080 name: http nodePort: 30080 selector: app: goldpinger --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: goldpinger-clusterrole rules: - apiGroups: - "" resources: - pods verbs: - list
--- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: goldpinger-clusterrolebinding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: goldpinger-clusterrole subjects: - kind: ServiceAccount name: goldpinger-serviceaccount namespace: goldpinger
|
2 argocd同步
这里直接查看状态了
3 查看goldpinger
点击这里就会跳转到annotations
里配的地址
可以看到已经有拓扑了
监控与告警
1 prometheus采集
1 2 3 4
| - job_name: 'goldpinger' metrics_path: '/metrics' static_configs: - targets: ["10.1.1.1:30080"]
|
2 告警规则
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| rules]# cat goldpinger.yml groups: - name: goldpinger rules: - alert: goldpinger_nodes_unhealthy expr: sum(goldpinger_nodes_health_total{status="unhealthy"}) BY (instance, goldpinger_instance) > 0 for: 5m annotations: description: | Goldpinger instance {{ $labels.goldpinger_instance }} has been reporting unhealthy nodes for at least 5 minutes. summary: 'Instance {{ $labels.instance }} down' - alert: goldpinger_http_unhealthy expr: sum(goldpinger_http_errors_total) BY (instance, goldpinger_instance, host) > 0 for: 1m annotations: description: | Goldpinger instance {{ $labels.goldpinger_instance }} unable to access {{ $labels.host }} for at least 1 minutes. summary: '{{ $labels.goldpinger_instance }} unable to access {{ $labels.host }}' - alert: goldpinger_tcp_unhealthy expr: sum(goldpinger_tcp_errors_total) BY (instance, goldpinger_instance, host) > 1 for: 1m annotations: description: | Goldpinger instance {{ $labels.goldpinger_instance }} unable to access {{ $labels.host }} for at least 1 minutes. summary: '{{ $labels.goldpinger_instance }} unable to access {{ $labels.host }}'
|
3 grafana
导入此模板
3.1 查看