K8S部署Redis集群(ot-redis-operator方式)
myluzh 发布于 阅读:430 Kubernetes
0x01 介绍
github地址:https://github.com/OT-CONTAINER-KIT/redis-operator
通过redis-operater支持以下方式部署:
RedisCluster 适用于需要高可用性和数据分片的场景,适合大规模部署。
RedisReplication 适合需要读写分离的场景,提供主从复制和负载均衡。
RedisSentinel 适合需要高可用性和自动故障转移的场景,提供监控和故障恢复能力。
Redis 适合简单的单实例部署,用于开发、测试或小型应用。
注意:要使用opstree/redis:xxx镜像,不能用redis官方镜像,要不然有问题。https://quay.io/repository/opstree/redis?tab=tags
0x02 部署redis-operator
helm在线部署
helm repo add ot-helm https://ot-container-kit.github.io/helm-charts/
helm upgrade redis-operator ot-helm/redis-operator \
--install --create-namespace --namespace redis-system
kubectl get pod -n redis-system
helm离线部署
下载redis-operator的Helm Chart包:https://github.com/OT-CONTAINER-KIT/helm-charts/releases?q=redis-operator
wget https://github.com/OT-CONTAINER-KIT/helm-charts/releases/download/redis-operator-0.21.2/redis-operator-0.21.2.tgz
helm upgrade redis-operator ./redis-operator-0.21.2.tgz \
--install --create-namespace --namespace redis-system
关于operator 跟对应redis版本支持 可以看支持表:https://github.com/OT-CONTAINER-KIT/redis-operator?tab=readme-ov-file#image-compatibility
# Operator Version: v0.19.x
# Redis Image: > v7.0.12, >= v6.2.14
# Sentinel Image: > v7.0.12, >= v6.2.14
# Exporter Image: v1.44.0
0x03 部署redis集群
需要先创建个secret,定义redis的密码,后面部署要用。
---
apiVersion: v1
kind: Secret
metadata:
name: redis-secret
namespace: my-app
data:
password: TXlsdXpoQDEyMzQlMEE=
type: Opaque
也可以命令直接创建:kubectl create secret generic redis-secret --from-literal=password=Myluzh@1234 -n my-app
部署redis分片集群
github example:https://github.com/OT-CONTAINER-KIT/redis-operator/blob/main/example/v1beta2/password_protected/clusterd.yaml
apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisCluster
metadata:
name: redis-cluster
namespace: my-app
spec:
clusterSize: 3
clusterVersion: v7
podSecurityContext:
runAsUser: 1000
fsGroup: 1000
persistenceEnabled: true
kubernetesConfig:
image: quay.io/opstree/redis:v7.0.15
imagePullPolicy: IfNotPresent
redisSecret:
name: redis-secret
key: password
redisExporter:
enabled: false
image: quay.io/opstree/redis-exporter:v1.44.0
storage:
volumeClaimTemplate:
spec:
# storageClassName: standard
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
nodeConfVolumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
查看状态
root@k8s-master01:~# kubectl get Rediscluster -n my-app
NAME CLUSTERSIZE READYLEADERREPLICAS READYFOLLOWERREPLICAS
redis-cluster 3 3 3
root@k8s-master01:~# kubectl get pod -n my-app
NAME READY STATUS RESTARTS AGE
redis-cluster-follower-0 1/1 Running 0 2m51s
redis-cluster-follower-1 1/1 Running 0 2m49s
redis-cluster-follower-2 1/1 Running 0 2m46s
redis-cluster-leader-0 1/1 Running 0 3m2s
redis-cluster-leader-1 1/1 Running 0 2m58s
redis-cluster-leader-2 1/1 Running 0 2m55s
root@k8s-master01:~# kubectl get svc -n my-app
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-cluster-follower ClusterIP 10.43.76.76 <none> 6379/TCP 2m57s
redis-cluster-follower-additional ClusterIP 10.43.27.197 <none> 6379/TCP 2m57s
redis-cluster-follower-headless ClusterIP None <none> 6379/TCP 2m57s
redis-cluster-leader ClusterIP 10.43.141.199 <none> 6379/TCP 3m6s
redis-cluster-leader-additional ClusterIP 10.43.220.170 <none> 6379/TCP 3m5s
redis-cluster-leader-headless ClusterIP None <none> 6379/TCP 3m7s
redis-cluster-master ClusterIP 10.43.59.246 <none> 6379/TCP 3m5s
部署redis主从副本+哨兵
github example:https://github.com/OT-CONTAINER-KIT/redis-operator/blob/main/example/v1beta2/password_protected/sentinel.yaml
apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisSentinel
metadata:
# will append '-sentinel' to the names of StatefulSet and Pods, e.g. 'redis-sentinel-sentinel'
name: redis-sentinel
namespace: my-app
spec:
clusterSize: 3
podSecurityContext:
runAsUser: 1000
fsGroup: 1000
# this is needed for the sentinel to connect to the nodes. `redisSecret` only controls access to sentinel
redisSentinelConfig:
# 这里的redisReplicationName要跟下面的RedisReplication name对应。
redisReplicationName: redis-replication
redisReplicationPassword:
secretKeyRef:
name: redis-secret
key: password
kubernetesConfig:
image: quay.io/opstree/redis-sentinel:v6.2.17
imagePullPolicy: IfNotPresent
# only controls access to sentinel, use `redisReplicationPassword` for node connection
redisSecret:
name: redis-secret
key: password
resources:
requests:
cpu: 101m
memory: 128Mi
limits:
cpu: 101m
memory: 128Mi
---
apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisReplication
metadata:
name: redis-replication
namespace: my-app
spec:
clusterSize: 3
kubernetesConfig:
image: quay.io/opstree/redis:v6.2.17
imagePullPolicy: IfNotPresent
redisSecret:
name: redis-secret
key: password
storage:
volumeClaimTemplate:
spec:
# storageClassName: standard
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
redisExporter:
enabled: false
image: quay.io/opstree/redis-exporter:v1.44.0
podSecurityContext:
runAsUser: 1000
fsGroup: 1000
查看状态
root@k8s-master01:~# kubectl get RedisReplication -n my-app
NAME MASTER AGE
redis-replication redis-replication-0 10m
root@k8s-master01:~# kubectl get RedisSentinel -n my-app
NAME AGE
redis-sentinel 10m
root@k8s-master01:~# kubectl get pod -n my-app
NAME READY STATUS RESTARTS AGE
redis-replication-0 1/1 Running 0 103s
redis-replication-1 1/1 Running 0 84s
redis-replication-2 1/1 Running 0 70s
redis-sentinel-sentinel-0 1/1 Running 0 29s
redis-sentinel-sentinel-1 1/1 Running 0 27s
redis-sentinel-sentinel-2 1/1 Running 0 25s
root@k8s-master01:~# kubectl get svc -n my-app
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-replication ClusterIP 10.43.100.80 <none> 6379/TCP 111s
redis-replication-additional ClusterIP 10.43.117.24 <none> 6379/TCP 111s
redis-replication-headless ClusterIP None <none> 6379/TCP 111s
redis-replication-master ClusterIP 10.43.60.17 <none> 6379/TCP 111s
redis-replication-replica ClusterIP 10.43.129.103 <none> 6379/TCP 111s
redis-sentinel-sentinel ClusterIP 10.43.87.203 <none> 26379/TCP 27s
redis-sentinel-sentinel-additional ClusterIP 10.43.230.83 <none> 26379/TCP 27s
redis-sentinel-sentinel-headless ClusterIP None <none> 26379/TCP 28s
0x04 模拟故障
# 首先查看redis主节点在哪
root@k8s-master01:~# kubectl -n my-app exec -it redis-client -- sh
/sentinel-data $ redis-cli -p 26379 -a Myluzh@1234
OK
127.0.0.1:26379> SENTINEL masters
1) 1) "name"
2) "myMaster"
3) "ip"
4) "10.42.1.241"
5) "port"
6) "6379"
# 把reids主节点所在的机器直接关机,模拟故障
root@k8s-master01:~# kubectl get pod -n my-app -o wide | grep 10.42.1.241
redis-replication-0 1/1 Running 0 75m 10.42.1.241 k8s-worker01 <none> <none>
root@k8s-worker01:~# poweroff
# 可以看到已经停止了
root@k8s-master01:~# kubectl get pod -n my-app -o wide | grep worker01
redis-client 1/1 Terminating 15 (19m ago) 15h 10.42.1.131 k8s-worker01 <none> <none>
redis-replication-0 1/1 Terminating 0 79m 10.42.1.241 k8s-worker01 <none> <none>
redis-sentinel-sentinel-0 1/1 Terminating 0 30m 10.42.1.121 k8s-worker01 <none> <none>
# 再次查看redis主节点,可以发现已经改变
root@k8s-master01:~# kubectl -n my-app exec -it rredis-client -- sh
/sentinel-data $ redis-cli -p 26379 -a Myluzh@1234
OK
127.0.0.1:26379> SENTINEL masters
1) 1) "name"
2) "myMaster"
3) "ip"
4) "10.42.2.102"
0x05 监控配置
redis-exporter服务监控配置
让prometheus能够获取到redis的指标。首先需要部署prometheus,这里不再赘述。
监控相关配置可参考文档 https://ot-container-kit.github.io/redis-operator/guide/monitoring.html
看下svc标签
root@k8s-master01:~# kubectl get svc -n pmip-app --show-labels
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE LABELS
redis-replication ClusterIP 10.43.33.33 <none> 6379/TCP,9121/TCP 29h app=redis-replication,redis_setup_type=replication,role=replication
查看9121端口的name字段
root@k8s-master01:~# kubectl -n pmip-app get svc redis-replication -o jsonpath='{.spec.ports[?(@.port==9121)].name}'
redis-exporter
编写yaml
root@k8s-master01:~# cat redis-servermonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: redis-replication-metrics
namespace: monitoring # Prometheus 所在命名空间
labels:
app: redis-replication
spec:
jobLabel: redis-replication
selector:
matchLabels:
app: redis-replication # 匹配 Service 的标签
namespaceSelector:
matchNames:
- pmip-app # 指定 redis 所在命名空间
endpoints:
- port: "redis-exporter" # 这个是 service 的端口名称,即 service yaml的spec.ports.name, 不是填metrics端口号 9121
interval: 15s
path: /metrics
没问题的话prometheus就可以收集到指标了,有问题就看prometheus日志。
grafana配置
官方推荐的grafana:https://ot-container-kit.github.io/redis-operator/guide/grafana.html | Redis Operator Cluster Dashboard for Prometheus
也可以直接到grafana官方搜索redis相关的dashboard,https://grafana.com/grafana/dashboards/?pg=graf&plcmt=dashboard-below-txt&search=redis, 选择中意的dashboard导入到自有的grafana示例中即可。
告警配置
# cat rules.yaml
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
monitor-svc: redis-exporter
role: alert-rules
name: redis-metrics-rules
namespace: monitoring
spec:
groups:
- name: Redis-监控告警
rules:
- alert: 警报!Redis应用不可用
expr: redis_up == 0
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis应用不可用"
description: "Redis应用不可达\n 当前值 = {{ $value }}"
- alert: 警报!丢失Master节点
expr: (count(redis_instance_info{role="master"}) ) < 1
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} 丢失Redis master"
description: "Redis集群当前没有主节点\n 当前值 = {{ $value }}"
- alert: 警报!脑裂,主节点太多
expr: count(redis_instance_info{role="master"}) > 1
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis脑裂,主节点太多"
description: "{{ $labels.instance }} 主节点太多\n 当前值 = {{ $value }}"
- alert: 警报!Slave连接不可达
expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis丢失slave节点"
description: "Redis slave不可达.请确认主从同步状态\n 当前值 = {{ $value }}"
- alert: 警报!Redis副本不一致
expr: delta(redis_connected_slaves[1m]) < 0
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis 副本不一致"
description: "Redis集群丢失一个slave节点\n 当前值 = {{ $value }}"
- alert: 警报!Redis集群抖动
expr: changes(redis_connected_slaves[1m]) > 1
for: 2m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis集群抖动"
description: "Redis集群抖动,请检查.\n 当前值 = {{ $value }}"
- alert: 警报!持久化失败
expr: (time() - redis_rdb_last_save_timestamp_seconds) / 3600 > 24
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis持久化失败"
description: "Redis持久化失败(>24小时)\n 当前值 = {{ printf "%.1f" $value }}小时"
- alert: 警报!内存不足
expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
for: 2m
labels:
severity: 一般告警
annotations:
summary: "{{ $labels.instance }}系统内存不足"
description: "Redis占用系统内存(> 90%)\n 当前值 = {{ printf "%.2f" $value }}%"
- alert: 警报!Maxmemory不足
expr: redis_config_maxmemory !=0 and redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80
for: 2m
labels:
severity: 一般告警
annotations:
summary: "{{ $labels.instance }} Maxmemory设置太小"
description: "超出设置最大内存(> 80%)\n 当前值 = {{ printf "%.2f" $value }}%"
- alert: 警报!连接数太多
expr: redis_connected_clients > 200
for: 2m
labels:
severity: 一般告警
annotations:
summary: "{{ $labels.instance }} 实时连接数太多"
description: "连接数太多(>200)\n 当前值 = {{ $value }}"
- alert: 警报!连接数太少
expr: redis_connected_clients < 1
for: 2m
labels:
severity: 一般告警
annotations:
summary: "{{ $labels.instance }} 实时连接数太少"
description: "连接数(<1)\n 当前值 = {{ $value }}"
- alert: 警报!拒绝连接数
expr: increase(redis_rejected_connections_total[1m]) > 0
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} 拒绝连接"
description: "Redis有拒绝连接,请检查连接数配置\n 当前值 = {{ printf "%.0f" $value }}"
- alert: 警报!执行命令数过多
expr: rate(redis_commands_processed_total[1m]) > 1000
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} 执行命令次数太多"
description: "Redis执行命令次数太多\n 当前值 = {{ printf "%.0f" $value }}"
参考链接:
K8s采用Operator部署redis-cluster实战指南 https://lbs.wiki/pages/6024f5c5/index.html
https://github.com/OT-CONTAINER-KIT/redis-operator/tree/master/example
https://github.com/OT-CONTAINER-KIT/helm-charts/tree/main/charts/redis-cluster
https://ot-container-kit.github.io/redis-operator/guide/redis-config.html
k8s kubernetes helm redis rediscluster redisreplication redissentinel ot ot-redis-operator