https://blog.huli.com/
在Kubernetes系统中,Pod的管理对象RC、Deployment、DaemonSet 和Job都面向无状态的服务。但现实中有很多服务是有状态的,特别是 一些复杂的中间件集群,例如MySQL集群、MongoDB集群、Akka集 群、ZooKeeper集群等,这些应用集群有4个共同点。
(1)每个节点都有固定的身份ID,通过这个ID,集群中的成员可 以相互发现并通信。
(2)集群的规模是比较固定的,集群规模不能随意变动。
(3)集群中的每个节点都是有状态的,通常会持久化数据到永久 存储中。
(4)如果磁盘损坏,则集群里的某个节点无法正常运行,集群功 能受损。
如果通过RC或Deployment控制Pod副本数量来实现上述有状态的集 群,就会发现第1点是无法满足的,因为Pod的名称是随机产生的,Pod 的IP地址也是在运行期才确定且可能有变动的,我们事先无法为每个 Pod都确定唯一不变的ID。另外,为了能够在其他节点上恢复某个失败 的节点,这种集群中的Pod需要挂接某种共享存储,为了解决这个问 题,Kubernetes从1.4版本开始引入了PetSet这个新的资源对象,并且在 1.5版本时更名为StatefulSet,StatefulSet从本质上来说,可以看作 Deployment/RC的一个特殊变种,它有如下特性。
$(podname).$(headless service name)比如一个3节点的Kafka的StatefulSet集群对应的Headless Service的名 称为kafka,StatefulSet的名称为kafka,则StatefulSet里的3个Pod的DNS 名称分别为kafka-0.kafka、kafka-1.kafka、kafka-3.kafka,这些DNS名称 可以直接在集群的配置文件中固定下来
以一个简单的 nginx 服务 web.yaml为例:
apiVersion: v1kind: Servicemetadata: name: nginx labels: app: nginxspec: ports: - port: 80 name: web clusterIP: None selector: app: nginx---apiVersion: apps/v1kind: StatefulSetmetadata: name: webspec: serviceName: "nginx" replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx-slim:0.8 ports: - containerPort: 80 name: web volumeMounts: - name: www mountPath: /usr/share/nginx/html volumeClaimTemplates: - metadata: name: www spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 1Gi(base) [root@liuxinyuan-master yaml]$ kubectl create -f web.yamlservice "nginx" createdstatefulset "web" created# 查看创建的 headless service 和 statefulset(base) [root@liuxinyuan-master yaml]$kubectl get service nginxNAME CLUSTER-IP EXTERNAL-IP PORT(S) AGEnginx None <none> 80/TCP 1m(base) [root@liuxinyuan-master yaml]$ kubectl get statefulset webNAME DESIRED CURRENT AGEweb 2 2 2m# 根据 volumeClaimTemplates 自动创建 PVC(在 GCE 中会自动创建 kubernetes.io/gce-pd 类型的 volume)(base) [root@liuxinyuan-master yaml]$kubectl get pvcNAME STATUS VOLUME CAPACITY ACCESSMODES AGEwww-web-0 Bound pvc-d064a004-d8d4-11e6-b521-42010a800002 1Gi RWO 16swww-web-1 Bound pvc-d06a3946-d8d4-11e6-b521-42010a800002 1Gi RWO 16s# 查看创建的 Pod,他们都是有序的(base) [root@liuxinyuan-master yaml]$ kubectl get pods -l app=nginxNAME READY STATUS RESTARTS AGEweb-0 1/1 Running 0 5mweb-1 1/1 Running 0 4m# 使用 nslookup 查看这些 Pod 的 DNS(base) [root@liuxinyuan-master yaml]$ kubectl run -i --tty --image busybox dns-test --restart=Never --rm /bin/sh/ # nslookup web-0.nginxServer: 10.0.0.10Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.localName: web-0.nginxAddress 1: 10.244.2.10/ # nslookup web-1.nginxServer: 10.0.0.10Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.localName: web-1.nginxAddress 1: 10.244.3.12/ # nslookup web-0.nginx.default.svc.cluster.localServer: 10.0.0.10Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.localName: web-0.nginx.default.svc.cluster.localAddress 1: 10.244.2.10还可以进行其他的操作
# 扩容(base) [root@liuxinyuan-master yaml]$ kubectl scale statefulset web --replicas=5# 缩容(base) [root@liuxinyuan-master yaml]$ kubectl patch statefulset web -p '{"spec":{"replicas":3}}'# 镜像更新(目前还不支持直接更新 image,需要 patch 来间接实现)(base) [root@liuxinyuan-master yaml]$ kubectl patch statefulset web --type='json' -p='[{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"gcr.io/google_containers/nginx-slim:0.7"}]'# 删除 StatefulSet 和 Headless Service(base) [root@liuxinyuan-master yaml]$ kubectl delete statefulset web(base) [root@liuxinyuan-master yaml]$ kubectl delete service nginx# StatefulSet 删除后 PVC 还会保留着,数据不再使用的话也需要删除(base) [root@liuxinyuan-master yaml]$ kubectl delete pvc www-web-0 www-web-1v1.7 + 支持 StatefulSet 的自动更新,通过 spec.updateStrategy 设置更新策略。目前支持两种策略
.spec.template 更新时,并不立即删除旧的 Pod,而是等待用户手动删除这些旧 Pod 后自动创建新 Pod。这是默认的更新策略,兼容 v1.6 版本的行为.spec.template 更新时,自动删除旧的 Pod 并创建新 Pod 替换。在更新时,这些 Pod 是按逆序的方式进行,依次删除、创建并等待 Pod 变成 Ready 状态才进行下一个 Pod 的更新。RollingUpdate 还支持 Partitions,通过 .spec.updateStrategy.rollingUpdate.partition 来设置。当 partition 设置后,只有序号大于或等于 partition 的 Pod 会在 .spec.template 更新的时候滚动更新,而其余的 Pod 则保持不变(即便是删除后也是用以前的版本重新创建)。
# 设置 partition 为 3(base) [root@liuxinyuan-master yaml]$ kubectl patch statefulset web -p '{"spec":{"updateStrategy":{"type":"RollingUpdate","rollingUpdate":{"partition":3}}}}'statefulset "web" patched# 更新 StatefulSet(base) [root@liuxinyuan-master yaml]$ kubectl patch statefulset web --type='json' -p='[{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"gcr.io/google_containers/nginx-slim:0.7"}]'statefulset "web" patched# 验证更新(base) [root@liuxinyuan-master yaml]$ kubectl delete po web-2pod "web-2" deleted(base) [root@liuxinyuan-master yaml]$ kubectl get po -lapp=nginx -wNAME READY STATUS RESTARTS AGEweb-0 1/1 Running 0 4mweb-1 1/1 Running 0 4mweb-2 0/1 ContainerCreating 0 11sweb-2 1/1 Running 0 18sv1.7 + 可以通过 .spec.podManagementPolicy 设置 Pod 管理策略,支持两种方式
---apiVersion: v1kind: Servicemetadata: name: nginx labels: app: nginxspec: ports: - port: 80 name: web clusterIP: None selector: app: nginx---apiVersion: apps/v1beta1kind: StatefulSetmetadata: name: webspec: serviceName: "nginx" podManagementPolicy: "Parallel" replicas: 2 template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx-slim:0.8 ports: - containerPort: 80 name: web volumeMounts: - name: www mountPath: /usr/share/nginx/html volumeClaimTemplates: - metadata: name: www spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 1Gi可以看到,所有 Pod 是并行创建的
(base) [root@liuxinyuan-master yaml]$ kubectl create -f webp.yamlservice "nginx" createdstatefulset "web" created(base) [root@liuxinyuan-master yaml]$ kubectl get po -lapp=nginx -wNAME READY STATUS RESTARTS AGEweb-0 0/1 Pending 0 0sweb-0 0/1 Pending 0 0sweb-1 0/1 Pending 0 0sweb-1 0/1 Pending 0 0sweb-0 0/1 ContainerCreating 0 0sweb-1 0/1 ContainerCreating 0 0sweb-0 1/1 Running 0 10sweb-1 1/1 Running 0 10s另外一个更能说明 StatefulSet 强大功能的示例为 zookeeper.yaml。
---apiVersion: v1kind: Servicemetadata: name: zk-headless labels: app: zk-headlessspec: ports: - port: 2888 name: server - port: 3888 name: leader-election clusterIP: None selector: app: zk---apiVersion: v1kind: ConfigMapmetadata: name: zk-configdata: ensemble: "zk-0;zk-1;zk-2" jvm.heap: "2G" tick: "2000" init: "10" sync: "5" client.cnxns: "60" snap.retain: "3" purge.interval: "1"---apiVersion: policy/v1beta1kind: PodDisruptionBudgetmetadata: name: zk-budgetspec: selector: matchLabels: app: zk minAvailable: 2---apiVersion: apps/v1beta1kind: StatefulSetmetadata: name: zkspec: serviceName: zk-headless replicas: 3 template: metadata: labels: app: zk annotations: pod.alpha.kubernetes.io/initialized: "true" scheduler.alpha.kubernetes.io/affinity: > { "podAntiAffinity": { "requiredDuringSchedulingRequiredDuringExecution": [{ "labelSelector": { "matchExpressions": [{ "key": "app", "operator": "In", "values": ["zk-headless"] }] }, "topologyKey": "kubernetes.io/hostname" }] } } spec: containers: - name: k8szk imagePullPolicy: Always image: gcr.io/google_samples/k8szk:v1 resources: requests: memory: "4Gi" cpu: "1" ports: - containerPort: 2181 name: client - containerPort: 2888 name: server - containerPort: 3888 name: leader-election env: - name : ZK_ENSEMBLE valueFrom: configMapKeyRef: name: zk-config key: ensemble - name : ZK_HEAP_SIZE valueFrom: configMapKeyRef: name: zk-config key: jvm.heap - name : ZK_TICK_TIME valueFrom: configMapKeyRef: name: zk-config key: tick - name : ZK_INIT_LIMIT valueFrom: configMapKeyRef: name: zk-config key: init - name : ZK_SYNC_LIMIT valueFrom: configMapKeyRef: name: zk-config key: tick - name : ZK_MAX_CLIENT_CNXNS valueFrom: configMapKeyRef: name: zk-config key: client.cnxns - name: ZK_SNAP_RETAIN_COUNT valueFrom: configMapKeyRef: name: zk-config key: snap.retain - name: ZK_PURGE_INTERVAL valueFrom: configMapKeyRef: name: zk-config key: purge.interval - name: ZK_CLIENT_PORT value: "2181" - name: ZK_SERVER_PORT value: "2888" - name: ZK_ELECTION_PORT value: "3888" command: - sh - -c - zkGenConfig.sh && zkServer.sh start-foreground readinessProbe: exec: command: - "zkOk.sh" initialDelaySeconds: 15 timeoutSeconds: 5 livenessProbe: exec: command: - "zkOk.sh" initialDelaySeconds: 15 timeoutSeconds: 5 volumeMounts: - name: datadir mountPath: /var/lib/zookeeper securityContext: runAsUser: 1000 fsGroup: 1000 volumeClaimTemplates: - metadata: name: datadir annotations: volume.alpha.kubernetes.io/storage-class: anything spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 20Gi(base) [root@liuxinyuan-master yaml]$ kubectl create -f zookeeper.yaml详细的使用说明见 zookeeper stateful application。