Setup
Prerequisite
vSphere with Tanzu Deployment
From Kubeflow documentation, the prerequisties for Kubeflow 1.4 installation are
Kubernetes(tested with version1.19) with a defaultStorageClasskustomize(version3.2.0)kubectl
The following is an example to deploy TKG cluster v1.19 on vSphere with Tanzu.
1# Create a new tkg cluster
2$ kubectl vsphere login --server=10.117.233.1 \
3 --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify
4$ kubectl config use-context liuqi
5$ cat << EOF | kubectl apply -f -
6apiVersion: run.tanzu.vmware.com/v1alpha1
7kind: TanzuKubernetesCluster
8metadata:
9 name: tkgs-cluster-2 # cluster name, user defined
10 namespace: liuqi # vsphere namespace
11spec:
12 distribution:
13 version: v1.19 # resolves to latest TKG 1.19
14 topology:
15 controlPlane:
16 count: 1 # number of control plane nodes
17 class: best-effort-medium # vmclass for control plane nodes
18 storageClass: pacific-storage-policy # storageclass for control plane
19 workers:
20 count: 7 # number of worker nodes
21 class: best-effort-medium # vmclass for worker nodes
22 storageClass: pacific-storage-policy # storageclass for worker nodes
23EOF
24
25# Wait for the cluster ready
26$ kubectl get tanzukubernetesclusters
Note
Refer to the following document to synchronize the local content library for TKG v1.19
Create, Secure, and Synchronize a Local Content Library for Tanzu Kubernetes releases
(Optional, XXX) You may need to patch API server and set docker hub credentials
A script is also provided to perform the above jobs.
Project Thunder Deployment
Deploy other Kubernetes Platforms
Use Installing a cluster on vSphere and this page to deploy OpenShift.
Networking requirements
The API address is used to access the cluster API.
api.ocp4-cluster-001.liuqi.io 10.105.136.130The Ingress address is used for cluster ingress traffic.
*.apps.ocp4-cluster-001.liuqi.io 10.105.136.131
Generating install-config.yaml
1# Generating install-config.yaml 2openshift-install create install-config --dir=ipi 3? SSH Public Key <none> 4? Platform vsphere 5? vCenter sha1-skevin-vc01.eng.vmware.com 6? Username administrator@vsphere.local 7? Password [? for help] ******** 8INFO Connecting to vCenter sha1-skevin-vc01.eng.vmware.com 9INFO Defaulting to only available datacenter: VCP 10INFO Defaulting to only available cluster: WCP-Cluster 11? Default Datastore vsanDatastore 12? Network VM Network 136 13? Virtual IP Address for API [? for help] 10.105.136.130 #API address 14? Virtual IP Address for Ingress [? for help] 10.105.136.131 #Ingress address 15? Base Domain liuqi.io 16? Cluster Name ocp4-cluster-001 17? Pull Secret [? for help]
Modify install-config.yaml to add proxy configuration
There is a example of install-config.yaml
1apiVersion: v1 2baseDomain: liuqi.io 3proxy: # add proxy configuration 4 httpProxy: http://proxy.vmware.com:3128 5 httpsProxy: http://proxy.vmware.com:3128 6 noProxy: .cluster.local,.svc,10.105.136.0/23,127.0.0.1,172.30.0.0/16,20.128.0.0/14,api-int.ocp4-cluster-001.liuqi.io,liuqi.io,localhost 7compute: 8- architecture: amd64 9 hyperthreading: Enabled 10 name: worker
Deploy the cluster
1# Deploy the cluster according to install-config.yaml 2# --dir must be the one where the install-config.yaml file is located 3openshift-install create cluster --dir /home/redcloud/ipi/ipi/
Following Creating registry storage to finish storage configuration.
Test the cluster
Using Openshift CLI access the cluster as the system:admin user when using
oc, runexport KUBECONFIG=<installation_directory>/auth/kubeconfig
1#check if all nodes are ready 2oc get nodes 3#check if all pods are running or completed 4oc get pods -A 5#check if all clusteroperators are running 6oc get co
Access the OpenShift web-console here: https://console-openshift-console.apps.ocp4-cluster-001.liuqi.io; user is kubeadmin, and password is stored in the dir <installation_directory>/auth/kubeadmin-password.
Test proxy
1# create a new project 2oc new-project zyajing-proj 3# create pod in this new project and pull image from google repo 4kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname -n zyajing-proj 5#if pod is running, that mean proxy configuration is success. 6oc get pod -n zyajing-proj 7NAME READY STATUS RESTARTS AGE 8pod/hello-node-7999f8f5bb-thswn 1/1 Running 0 11s
How to ssh to othe node once the cluster is success.
1# ssh -i ssh-key/id_rsa core@<OC-NODE> 2ssh -i /root/.ssh/test_rsa core@10.105.137.224
Deploy on vSphere with Tanzu
Use the following commands to set the default storage class. Skip this step if the default storage class has been set.
1# https://anthonyspiteri.net/tanzu-no-default-storageclass/ 2$ kubectl config use-context liuqi 3$ kubectl edit tanzukubernetescluster tkgs-cluster-16 4# add the following content under spec/settings (same level as network setting) 5... 6storage: 7 defaultClass: pacific-storage-policy 8... 9$ kubectl config use-context tkgs-cluster-16 10$ kubectl get sc
Use the following commands to add the fstype parmeter to workaround PVC issue. Skip this step if this has been done.
1# https://bugzilla.eng.vmware.com/show_bug.cgi?id=2764622 2$ kubectl vsphere login --server=10.117.233.1 --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify --tanzu-kubernetes-cluster-namespace=liuqi --tanzu-kubernetes-cluster-name=tkgs-cluster-33 3$ kubectl get sc pacific-storage-policy -o yaml > tmp-sc.yaml 4$ sed '/^parameters:.*/a\ \ csi.storage.k8s.io/fstype: "ext4"' -i tmp-sc.yaml 5$ kubectl replace -f tmp-sc.yaml --force
Patch PSP
1$ cat << EOF | kubectl apply -f - 2apiVersion: v1 3kind: Namespace 4metadata: 5 name: auth 6--- 7kind: RoleBinding 8apiVersion: rbac.authorization.k8s.io/v1 9metadata: 10 name: rb-all-sa_ns-auth 11 namespace: auth 12roleRef: 13 kind: ClusterRole 14 name: psp:vmware-system-privileged 15 apiGroup: rbac.authorization.k8s.io 16subjects: 17- kind: Group 18 apiGroup: rbac.authorization.k8s.io 19 name: system:serviceaccounts:auth 20--- 21apiVersion: v1 22kind: Namespace 23metadata: 24 name: cert-manager 25--- 26kind: RoleBinding 27apiVersion: rbac.authorization.k8s.io/v1 28metadata: 29 name: rb-all-sa_ns-cert-manager 30 namespace: cert-manager 31roleRef: 32 kind: ClusterRole 33 name: psp:vmware-system-privileged 34 apiGroup: rbac.authorization.k8s.io 35subjects: 36- kind: Group 37 apiGroup: rbac.authorization.k8s.io 38 name: system:serviceaccounts:cert-manager 39--- 40apiVersion: v1 41kind: Namespace 42metadata: 43 name: istio-system 44--- 45kind: RoleBinding 46apiVersion: rbac.authorization.k8s.io/v1 47metadata: 48 name: rb-all-sa_ns-istio-system 49 namespace: istio-system 50roleRef: 51 kind: ClusterRole 52 name: psp:vmware-system-privileged 53 apiGroup: rbac.authorization.k8s.io 54subjects: 55- kind: Group 56 apiGroup: rbac.authorization.k8s.io 57 name: system:serviceaccounts:istio-system 58--- 59apiVersion: v1 60kind: Namespace 61metadata: 62 name: knative-serving 63--- 64kind: RoleBinding 65apiVersion: rbac.authorization.k8s.io/v1 66metadata: 67 name: rb-all-sa_ns-knative-serving 68 namespace: knative-serving 69roleRef: 70 kind: ClusterRole 71 name: psp:vmware-system-privileged 72 apiGroup: rbac.authorization.k8s.io 73subjects: 74- kind: Group 75 apiGroup: rbac.authorization.k8s.io 76 name: system:serviceaccounts:knative-serving 77--- 78apiVersion: v1 79kind: Namespace 80metadata: 81 name: kubeflow 82 labels: 83 control-plane: kubeflow 84 istio-injection: enabled 85--- 86kind: RoleBinding 87apiVersion: rbac.authorization.k8s.io/v1 88metadata: 89 name: rb-all-sa_ns-kubeflow 90 namespace: kubeflow 91roleRef: 92 kind: ClusterRole 93 name: psp:vmware-system-privileged 94 apiGroup: rbac.authorization.k8s.io 95subjects: 96- kind: Group 97 apiGroup: rbac.authorization.k8s.io 98 name: system:serviceaccounts:kubeflow 99EOF
Deploy Kubeflow step by step using the note here
Fix PSP issues for example namespace
1$ cat << EOF | kubectl apply -f - 2kind: RoleBinding 3apiVersion: rbac.authorization.k8s.io/v1 4metadata: 5 name: rb-all-sa_ns-kubeflow-user-example-com 6 namespace: kubeflow-user-example-com 7roleRef: 8 kind: ClusterRole 9 name: psp:vmware-system-privileged 10 apiGroup: rbac.authorization.k8s.io 11subjects: 12- kind: Group 13 apiGroup: rbac.authorization.k8s.io 14 name: system:serviceaccounts:kubeflow-user-example-com 15EOF
Deploy with Kubernetes Operator
Deploy with Supervisor Services on vSphere with Tanzu
Deploy on other Kubernetes Platform
See also
Check kubeflow requirements
Code Ready Containers Resources: If you are using Code Ready Containers, you need to make sure you have enough resources configured for the VM:
1# Recommended: (to check every openshift node resouces.)
216 GB memory
36 CPU
445 GB disk space
5
6
7# Minimal:
810 GB memory
96 CPU
1030 GB disk space (default for CRC)
Workflow to deploy Kubeflow on OpenShift
Please read Kubeflow Installing on OpenShift this websit and this page to deploy OpenShift
Clone the opendatahub/manifests repository. This repository defaults to the v1.3-branch-openshift branch. But we need to deploy kubeflow 1.4 and there is no v1.4-branch kubeflow branch,so you need to yourself kubeflow 1.4 repo.
1git clone https://github.com/AmyHoney/kubeflow-1.4 2cd manifests
Build the deployment configuration using the OpenShift KFDef file and local downloaded manifests
1# update the manifest repo URI 2sed -i 's#uri: .*#uri: '$PWD'#' ./kfdef/kfctl_openshift.yaml 3 4# set the Kubeflow application diretory for this deployment, for example /opt/openshift-kfdef 5export KF_DIR=<path-to-kfdef> 6mkdir -p ${KF_DIR} 7cp ./kfdef/kfctl_openshift.yaml ${KF_DIR} 8 9# build deployment configuration 10cd ${KF_DIR} 11 12[vcp@mlops-oss openshift-kfdef]$ kfctl build --file=kfctl_openshift.yaml 13[vcp@mlops-oss openshift-kfdef]$ ls 14kfctl_openshift.yaml kustomize
Apply the generated deployment configuration.
1kfctl apply --file=kfctl_openshift.yaml
Wait until all the pods are running in kubeflow namespace.
1oc get pods -n kubeflow 2NAME READY STATUS RESTARTS AGE 3argo-ui-7f79c9ccbc-vxqgx 1/1 Running 0 7m55s 4centraldashboard-65d87fb769-d8l5g 1/1 Running 0 7m55s 5jupyter-web-app-deployment-6748fc47cc-78hr4 1/1 Running 0 7m 6katib-controller-7dd757bdf-wmg2t 1/1 Running 1 6m57s 7.......
The command below looks up the URL of the Kubeflow user interface assigned by the OpenShift cluster. You can open the printed URL in your browser to access the Kubeflow user interface.
1# get kubeflow ui website as follow 2oc get routes -n istio-system istio-ingressgateway -o jsonpath='http://{.spec.host}/' 3http://istio-ingressgateway-istio-system.apps.ocp4-cluster-001.liuqi.io/