Setup

Prerequisite

vSphere with Tanzu Deployment

From Kubeflow documentation, the prerequisties for Kubeflow 1.4 installation are

  • Kubernetes (tested with version 1.19) with a default StorageClass

  • kustomize (version 3.2.0)

  • kubectl

The following is an example to deploy TKG cluster v1.19 on vSphere with Tanzu.

 1# Create a new tkg cluster
 2$ kubectl vsphere login --server=10.117.233.1 \
 3   --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify
 4$ kubectl config use-context liuqi
 5$ cat << EOF | kubectl apply -f -
 6apiVersion: run.tanzu.vmware.com/v1alpha1
 7kind: TanzuKubernetesCluster
 8metadata:
 9  name: tkgs-cluster-2                     # cluster name, user defined
10  namespace: liuqi                         # vsphere namespace
11spec:
12  distribution:
13    version: v1.19                         # resolves to latest TKG 1.19
14  topology:
15    controlPlane:
16      count: 1                             # number of control plane nodes
17      class: best-effort-medium            # vmclass for control plane nodes
18      storageClass: pacific-storage-policy # storageclass for control plane
19    workers:
20      count: 7                             # number of worker nodes
21      class: best-effort-medium            # vmclass for worker nodes
22      storageClass: pacific-storage-policy # storageclass for worker nodes
23EOF
24
25# Wait for the cluster ready
26$ kubectl get tanzukubernetesclusters

Note

Refer to the following document to synchronize the local content library for TKG v1.19

Create, Secure, and Synchronize a Local Content Library for Tanzu Kubernetes releases

(Optional, XXX) You may need to patch API server and set docker hub credentials

A script is also provided to perform the above jobs.

Project Thunder Deployment

Deploy other Kubernetes Platforms

Use Installing a cluster on vSphere and this page to deploy OpenShift.

  1. Networking requirements

    • The API address is used to access the cluster API. api.ocp4-cluster-001.liuqi.io 10.105.136.130

    • The Ingress address is used for cluster ingress traffic. *.apps.ocp4-cluster-001.liuqi.io 10.105.136.131

  2. Generating install-config.yaml

     1# Generating install-config.yaml
     2openshift-install create install-config --dir=ipi
     3? SSH Public Key <none>
     4? Platform vsphere
     5? vCenter sha1-skevin-vc01.eng.vmware.com
     6? Username administrator@vsphere.local
     7? Password [? for help] ********
     8INFO Connecting to vCenter sha1-skevin-vc01.eng.vmware.com
     9INFO Defaulting to only available datacenter: VCP
    10INFO Defaulting to only available cluster: WCP-Cluster
    11? Default Datastore vsanDatastore
    12? Network VM Network 136
    13? Virtual IP Address for API [? for help] 10.105.136.130      #API address
    14? Virtual IP Address for Ingress [? for help] 10.105.136.131  #Ingress address
    15? Base Domain liuqi.io
    16? Cluster Name ocp4-cluster-001
    17? Pull Secret [? for help]
    
  3. Modify install-config.yaml to add proxy configuration

    There is a example of install-config.yaml

     1apiVersion: v1
     2baseDomain: liuqi.io
     3proxy:  # add proxy configuration
     4  httpProxy: http://proxy.vmware.com:3128
     5  httpsProxy: http://proxy.vmware.com:3128
     6  noProxy: .cluster.local,.svc,10.105.136.0/23,127.0.0.1,172.30.0.0/16,20.128.0.0/14,api-int.ocp4-cluster-001.liuqi.io,liuqi.io,localhost
     7compute:
     8- architecture: amd64
     9  hyperthreading: Enabled
    10  name: worker
    
  4. Deploy the cluster

    1# Deploy the cluster according to install-config.yaml
    2# --dir must be the one where the install-config.yaml file is located
    3openshift-install create cluster --dir /home/redcloud/ipi/ipi/
    
  5. Following Creating registry storage to finish storage configuration.

  6. Test the cluster

    • Using Openshift CLI access the cluster as the system:admin user when using oc, run export KUBECONFIG=<installation_directory>/auth/kubeconfig

    1#check if all nodes are ready
    2oc get nodes
    3#check if all pods are running or completed
    4oc get pods -A
    5#check if all clusteroperators are running
    6oc get co
    
  7. Test proxy

    1# create a new project
    2oc new-project zyajing-proj
    3# create pod in this new project and pull image from google repo
    4kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname -n zyajing-proj
    5#if pod is running, that mean proxy configuration is success.
    6oc get pod -n zyajing-proj
    7NAME                              READY   STATUS    RESTARTS   AGE
    8pod/hello-node-7999f8f5bb-thswn   1/1     Running   0          11s
    
  8. How to ssh to othe node once the cluster is success.

    1# ssh -i ssh-key/id_rsa core@<OC-NODE>
    2ssh -i /root/.ssh/test_rsa core@10.105.137.224
    

Deploy on vSphere with Tanzu

  1. Use the following commands to set the default storage class. Skip this step if the default storage class has been set.

     1# https://anthonyspiteri.net/tanzu-no-default-storageclass/
     2$ kubectl config use-context liuqi
     3$ kubectl edit tanzukubernetescluster tkgs-cluster-16
     4# add the following content under spec/settings (same level as network setting)
     5...
     6storage:
     7  defaultClass: pacific-storage-policy
     8...
     9$ kubectl config use-context tkgs-cluster-16
    10$ kubectl get sc
    
  2. Use the following commands to add the fstype parmeter to workaround PVC issue. Skip this step if this has been done.

    1# https://bugzilla.eng.vmware.com/show_bug.cgi?id=2764622
    2$ kubectl vsphere login --server=10.117.233.1 --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify --tanzu-kubernetes-cluster-namespace=liuqi --tanzu-kubernetes-cluster-name=tkgs-cluster-33
    3$ kubectl get sc pacific-storage-policy -o yaml > tmp-sc.yaml
    4$ sed '/^parameters:.*/a\ \ csi.storage.k8s.io/fstype: "ext4"' -i tmp-sc.yaml
    5$ kubectl replace -f tmp-sc.yaml --force
    
  3. Patch PSP

     1$ cat << EOF | kubectl apply -f -
     2apiVersion: v1
     3kind: Namespace
     4metadata:
     5  name: auth
     6---
     7kind: RoleBinding
     8apiVersion: rbac.authorization.k8s.io/v1
     9metadata:
    10  name: rb-all-sa_ns-auth
    11  namespace: auth
    12roleRef:
    13  kind: ClusterRole
    14  name: psp:vmware-system-privileged
    15  apiGroup: rbac.authorization.k8s.io
    16subjects:
    17- kind: Group
    18  apiGroup: rbac.authorization.k8s.io
    19  name: system:serviceaccounts:auth
    20---
    21apiVersion: v1
    22kind: Namespace
    23metadata:
    24  name: cert-manager
    25---
    26kind: RoleBinding
    27apiVersion: rbac.authorization.k8s.io/v1
    28metadata:
    29  name: rb-all-sa_ns-cert-manager
    30  namespace: cert-manager
    31roleRef:
    32  kind: ClusterRole
    33  name: psp:vmware-system-privileged
    34  apiGroup: rbac.authorization.k8s.io
    35subjects:
    36- kind: Group
    37  apiGroup: rbac.authorization.k8s.io
    38  name: system:serviceaccounts:cert-manager
    39---
    40apiVersion: v1
    41kind: Namespace
    42metadata:
    43  name: istio-system
    44---
    45kind: RoleBinding
    46apiVersion: rbac.authorization.k8s.io/v1
    47metadata:
    48  name: rb-all-sa_ns-istio-system
    49  namespace: istio-system
    50roleRef:
    51  kind: ClusterRole
    52  name: psp:vmware-system-privileged
    53  apiGroup: rbac.authorization.k8s.io
    54subjects:
    55- kind: Group
    56  apiGroup: rbac.authorization.k8s.io
    57  name: system:serviceaccounts:istio-system
    58---
    59apiVersion: v1
    60kind: Namespace
    61metadata:
    62  name: knative-serving
    63---
    64kind: RoleBinding
    65apiVersion: rbac.authorization.k8s.io/v1
    66metadata:
    67  name: rb-all-sa_ns-knative-serving
    68  namespace: knative-serving
    69roleRef:
    70  kind: ClusterRole
    71  name: psp:vmware-system-privileged
    72  apiGroup: rbac.authorization.k8s.io
    73subjects:
    74- kind: Group
    75  apiGroup: rbac.authorization.k8s.io
    76  name: system:serviceaccounts:knative-serving
    77---
    78apiVersion: v1
    79kind: Namespace
    80metadata:
    81  name: kubeflow
    82  labels:
    83    control-plane: kubeflow
    84    istio-injection: enabled
    85---
    86kind: RoleBinding
    87apiVersion: rbac.authorization.k8s.io/v1
    88metadata:
    89  name: rb-all-sa_ns-kubeflow
    90  namespace: kubeflow
    91roleRef:
    92  kind: ClusterRole
    93  name: psp:vmware-system-privileged
    94  apiGroup: rbac.authorization.k8s.io
    95subjects:
    96- kind: Group
    97  apiGroup: rbac.authorization.k8s.io
    98  name: system:serviceaccounts:kubeflow
    99EOF
    
  4. Deploy Kubeflow step by step using the note here

  5. Fix PSP issues for example namespace

     1$ cat << EOF | kubectl apply -f -
     2kind: RoleBinding
     3apiVersion: rbac.authorization.k8s.io/v1
     4metadata:
     5  name: rb-all-sa_ns-kubeflow-user-example-com
     6  namespace: kubeflow-user-example-com
     7roleRef:
     8  kind: ClusterRole
     9  name: psp:vmware-system-privileged
    10  apiGroup: rbac.authorization.k8s.io
    11subjects:
    12- kind: Group
    13  apiGroup: rbac.authorization.k8s.io
    14  name: system:serviceaccounts:kubeflow-user-example-com
    15EOF
    

Deploy with Kubernetes Operator

Deploy with Supervisor Services on vSphere with Tanzu

Deploy on other Kubernetes Platform

Check kubeflow requirements

Code Ready Containers Resources: If you are using Code Ready Containers, you need to make sure you have enough resources configured for the VM:

 1# Recommended: (to check every openshift node resouces.)
 216 GB memory
 36 CPU
 445 GB disk space
 5
 6
 7# Minimal:
 810 GB memory
 96 CPU
1030 GB disk space (default for CRC)

Workflow to deploy Kubeflow on OpenShift

Please read Kubeflow Installing on OpenShift this websit and this page to deploy OpenShift

  1. Clone the opendatahub/manifests repository. This repository defaults to the v1.3-branch-openshift branch. But we need to deploy kubeflow 1.4 and there is no v1.4-branch kubeflow branch,so you need to yourself kubeflow 1.4 repo.

    1git clone https://github.com/AmyHoney/kubeflow-1.4
    2cd manifests
    
  2. Build the deployment configuration using the OpenShift KFDef file and local downloaded manifests

     1# update the manifest repo URI
     2sed -i 's#uri: .*#uri: '$PWD'#' ./kfdef/kfctl_openshift.yaml
     3
     4# set the Kubeflow application diretory for this deployment, for example /opt/openshift-kfdef
     5export KF_DIR=<path-to-kfdef>
     6mkdir -p ${KF_DIR}
     7cp ./kfdef/kfctl_openshift.yaml ${KF_DIR}
     8
     9# build deployment configuration
    10cd ${KF_DIR}
    11
    12[vcp@mlops-oss openshift-kfdef]$ kfctl build --file=kfctl_openshift.yaml
    13[vcp@mlops-oss openshift-kfdef]$ ls
    14kfctl_openshift.yaml  kustomize
    
  3. Apply the generated deployment configuration.

    1kfctl apply --file=kfctl_openshift.yaml
    
  4. Wait until all the pods are running in kubeflow namespace.

    1oc get pods -n kubeflow
    2NAME                                                           READY   STATUS    RESTARTS   AGE
    3argo-ui-7f79c9ccbc-vxqgx                                       1/1     Running   0          7m55s
    4centraldashboard-65d87fb769-d8l5g                              1/1     Running   0          7m55s
    5jupyter-web-app-deployment-6748fc47cc-78hr4                    1/1     Running   0          7m
    6katib-controller-7dd757bdf-wmg2t                               1/1     Running   1          6m57s
    7.......
    
  5. The command below looks up the URL of the Kubeflow user interface assigned by the OpenShift cluster. You can open the printed URL in your browser to access the Kubeflow user interface.

    1# get kubeflow ui website as follow
    2oc get routes -n istio-system istio-ingressgateway -o jsonpath='http://{.spec.host}/'
    3http://istio-ingressgateway-istio-system.apps.ocp4-cluster-001.liuqi.io/
    

Security

Storage

Network