部署分布式训练 operator

operator介绍

operator 管理分布式训练容器,主要包含以下功能:

  • 管理分布式训练容器生命周期。
  • 为分布式训练容器注入其他容器ip。

准备环境

安装如下软件环境

  • Redis: 3.0+
  • OpenJDK: 1.8+
  • Maven: 3.0+
  • Git: 2.0.0+

构建镜像

首先拉取 Dubhe git 仓库最新源码至本地,再进入根目录;

cd Dubhe/distribute-train-operator/
mvn clean compile package
cd target/
#创建Dockerfile
cat > Dockerfile <<EOF
FROM java:8
ADD distribute-train-operator-1.0.jar /
CMD ["bash", "-c","exec java -jar \$JAR_BALL"]
EOF
#build <your-harbor-url>替换为harbor地址
docker build -t <your-harbor-url>/distribute-train/distribute-train-operator:v1 .

上传镜像到 Harbor

  • 在 Harbor 上创建 distribute-train 项目
  • 执行以下命令,your-harbor-url 是 Harbor 的地址
docker push <your-harbor-url>/distribute-train/distribute-train-operator:v1

创建配置文件

distribute-train-operator-deploy.yaml 配置说明

  • 配置 operator 节点 your-k8s-host-name
  • 配置 Harbor 地址 your-harbor-url
  • 配置 redis 连接参数
  • 复制 Kubernetes config 文件
apiVersion: apps/v1
kind: Deployment
metadata:
name: distribute-train-operator
namespace: kube-system
labels:
name: distribute-train-operator
spec:
replicas: 1
selector:
matchLabels:
name: distribute-train-operator
template:
metadata:
labels:
name: distribute-train-operator
spec:
nodeSelector:
#<your-k8s-host-name>是部署operator的Kubernetes节点hostname
kubernetes.io/hostname: <your-k8s-host-name>
containers:
- name: distribute-train-operator
#<your-harbor-url>是Harbor的地址
image: <your-harbor-url>/distribute-train/distribute-train-operator:v1
ports:
- containerPort: 8080
protocol: TCP
volumeMounts:
- mountPath: /root
name: config-volume
env:
- name: JAR_BALL
#<rdis-ip> 是redis的ip
#<redis-password> 是redis的密码没有的话把这项参数去掉
#<redis-port> 是redis的端口 默认6379
value: "distribute-train-operator-1.0.jar --k8s.kubeconfig=/root/config --spring.redis.host=<rdis-ip> --spring.redis.password=<redis-password> --spring.redis.port=<redis-port>"
imagePullPolicy: IfNotPresent
volumes:
- name: config-volume
hostPath:
#将Kubernetes集群 master节点的 $HOME/.kube/config 文件复制到 <your-k8s-host-name>节点的 /root/.kube/ 目录下
path: /root/.kube/
type:
restartPolicy: Always
terminationGracePeriodSeconds: 30
securityContext:
runAsUser: 0
schedulerName: default-scheduler
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
revisionHistoryLimit: 7
progressDeadlineSeconds: 600

执行部署文件

kubectl apply -f distribute-train-operator-deploy.yaml

验证

  • distribute-train-operator Pod Running 即说明部署成功
kubectl get pod -n kube-system | grep distribute-train-operator
Last updated on