目錄
注意:
由於我之前有安裝rook-ceph,也有用到prometheus監控rook-ceph集群,所以也有安裝prometheus-operator,現在rancher安裝的prometheus也會用到prometheus-operator,所以這裡要刪除掉,不然會有衝突,導致rancher安裝的prometheus無法啟動。
等安裝好rancher的prometheus,再來重新安裝rook-ceph用的prometheus。
root@k8s-master71u:~/rook/deploy/examples/monitoring# kubectl delete -f bundle.yaml
customresourcedefinition.apiextensions.k8s.io "alertmanagerconfigs.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "alertmanagers.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "podmonitors.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "probes.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "prometheusagents.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "prometheuses.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "prometheusrules.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "scrapeconfigs.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "thanosrulers.monitoring.coreos.com" deleted
clusterrolebinding.rbac.authorization.k8s.io "prometheus-operator" deleted
clusterrole.rbac.authorization.k8s.io "prometheus-operator" deleted
deployment.apps "prometheus-operator" deleted
serviceaccount "prometheus-operator" deleted
service "prometheus-operator" deleted
安裝Prometheus
點選左下角,Cluster Tools-> Monitoring
Prometheus修改資料存儲天數,和使用持久化數據
Grafana使用持久化數據
安裝完成
root@k8s-master71u:~/rook/deploy/examples/monitoring# kubectl get pod -n cattle-monitoring-system
NAME READY STATUS RESTARTS AGE
alertmanager-rancher-monitoring-alertmanager-0 2/2 Running 1 (49s ago) 58s
prometheus-rancher-monitoring-prometheus-0 3/3 Running 0 56s
rancher-monitoring-grafana-568d4fc6d5-zsnq7 3/4 Running 0 88s
rancher-monitoring-kube-state-metrics-56b4477cc-2zgmf 1/1 Running 0 88s
rancher-monitoring-operator-c66c76fd9-cns8h 1/1 Running 0 88s
rancher-monitoring-prometheus-adapter-7494f789f6-x2v6p 1/1 Running 0 88s
rancher-monitoring-prometheus-node-exporter-65n2k 1/1 Running 0 88s
rancher-monitoring-prometheus-node-exporter-7t4nt 1/1 Running 0 88s
rancher-monitoring-prometheus-node-exporter-s9kwh 1/1 Running 0 88s
rancher-monitoring-prometheus-node-exporter-w8tv9 1/1 Running 0 88s
rancher-monitoring-prometheus-node-exporter-zzqpc 1/1 Running 0 88s
測試Monitoring功能
查看Prometheus
查看Alertmanager
查看Grafana
重新安裝rook-ceph用的prometheus
重新安裝rook-ceph用的prometheus,prometheus-operator就不用安裝了,也就是最前面刪除的bundle.yaml那個檔案。
root@k8s-master71u:~/rook/deploy/examples/monitoring# kubectl create -f prometheus.yaml
prometheus.monitoring.coreos.com/rook-prometheus created
root@k8s-master71u:~/rook/deploy/examples/monitoring# kubectl create -f prometheus-service.yaml
The Service "rook-prometheus" is invalid: spec.ports[0].nodePort: Invalid value: 30900: provided port is already allocated
root@k8s-master71u:~/rook/deploy/examples/monitoring# kubectl create -f service-monitor.yaml
servicemonitor.monitoring.coreos.com/rook-ceph-mgr created
Kubelet 內部 cAdvisor 無法取得特定 Label
參考文章: https://ithelp.ithome.com.tw/m/articles/10331330
由於 Kubernetes 在 v1.24 後,剔除對 docker shim 的依賴,導致 Kubelet 內部原本用來收集各種容器指標的 cAdvisor 服務,沒辦法取得有關 image、container 等指標,直到目前這個問題依然持續發生著:
- https://github.com/google/cadvisor/issues/3162
- https://github.com/prometheus-community/helm-charts/issues/3058
- https://github.com/google/cadvisor/issues/3336
這造成我們 Kube-Prometheus-Stack 中許多關於服務資源使用量的圖表無法取得數據,大部分的圖表都沒辦法正常顯示。
目前社群上的解決方法為自己替 Kubelet 在 Kubernetes 叢集建立 cAdvisor 來產生正常的容器資源指標,以解決 Kubelet 在 v1.24 升級後普遍遇到的的問題。
首先讓我們先將 Kubelet 的 cAdvisor 功能關閉:
接下來,我們需要自訂安裝 cAdvisor 服務並建立 ServiceMonitor 資源,使 Prometheus 拉取容器指標:
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: cadvisor
name: cadvisor
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app: cadvisor
name: cadvisor
rules:
- apiGroups:
- policy
resourceNames:
- cadvisor
resources:
- podsecuritypolicies
verbs:
- use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app: cadvisor
name: cadvisor
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cadvisor
subjects:
- kind: ServiceAccount
name: cadvisor
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
seccomp.security.alpha.kubernetes.io/pod: docker/default
labels:
app: cadvisor
name: cadvisor
namespace: kube-system
spec:
selector:
matchLabels:
app: cadvisor
name: cadvisor
template:
metadata:
labels:
app: cadvisor
name: cadvisor
spec:
automountServiceAccountToken: false
containers:
- args:
- --housekeeping_interval=10s
- --max_housekeeping_interval=15s
- --event_storage_event_limit=default=0
- --event_storage_age_limit=default=0
- --enable_metrics=app,cpu,disk,diskIO,memory,network,process
- --docker_only
- --store_container_labels=false
- --whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace
image: gcr.io/cadvisor/cadvisor:v0.45.0
name: cadvisor
ports:
- containerPort: 8080
name: http
protocol: TCP
resources:
limits:
cpu: 800m
memory: 2000Mi
requests:
cpu: 400m
memory: 400Mi
volumeMounts:
- mountPath: /rootfs
name: rootfs
readOnly: true
- mountPath: /var/run
name: var-run
readOnly: true
- mountPath: /sys
name: sys
readOnly: true
- mountPath: /var/lib/docker
name: docker
readOnly: true
- mountPath: /dev/disk
name: disk
readOnly: true
priorityClassName: system-node-critical
serviceAccountName: cadvisor
terminationGracePeriodSeconds: 30
tolerations:
- key: node-role.kubernetes.io/controlplane
value: "true"
effect: NoSchedule
- key: node-role.kubernetes.io/etcd
value: "true"
effect: NoExecute
volumes:
- hostPath:
path: /
name: rootfs
- hostPath:
path: /var/run
name: var-run
- hostPath:
path: /sys
name: sys
- hostPath:
path: /var/lib/docker
name: docker
- hostPath:
path: /dev/disk
name: disk
---
apiVersion: v1
kind: Service
metadata:
name: cadvisor
labels:
app: cadvisor
namespace: kube-system
spec:
selector:
app: cadvisor
ports:
- name: cadvisor
port: 8080
protocol: TCP
targetPort: 8080
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app: cadvisor
release: prometheus-stack
name: cadvisor
namespace: kube-system
spec:
endpoints:
- metricRelabelings:
- sourceLabels:
- container_label_io_kubernetes_pod_name
targetLabel: pod
- sourceLabels:
- container_label_io_kubernetes_container_name
targetLabel: container
- sourceLabels:
- container_label_io_kubernetes_pod_namespace
targetLabel: namespace
- action: labeldrop
regex: container_label_io_kubernetes_pod_name
- action: labeldrop
regex: container_label_io_kubernetes_container_name
- action: labeldrop
regex: container_label_io_kubernetes_pod_namespace
port: cadvisor
relabelings:
- sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: node
- sourceLabels:
- __metrics_path__
targetLabel: metrics_path
replacement: /metrics/cadvisor
- sourceLabels:
- job
targetLabel: job
replacement: kubelet
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
app: cadvisor
ref:https://github.com/rancher/rancher/issues/38934#issuecomment-1294585708
執行上列設定後,cAdvisor 就能替我們開始收集指標:
root@k8s-master71u:~/kube-prometheus-stack# kubectl create -f cAdvisor.yaml
serviceaccount/cadvisor created
clusterrole.rbac.authorization.k8s.io/cadvisor created
clusterrolebinding.rbac.authorization.k8s.io/cadvisor created
daemonset.apps/cadvisor created
service/cadvisor created
servicemonitor.monitoring.coreos.com/cadvisor created
如此一來圖表可以正常顯示:
分配權限給User,可以查看監控數據
建立一個Project,取名Monitoring
設定User為此Project的成員
將cattle-monitoring-system命名空間移至此Project中
預設cattle-monitoring-system不屬於任何一個Project