Get going with JupyterLab on OpenShift

JupyterLab is the most widely used data science / machine learning IDE. Deploying it on OpenShift / Kubernetes adds another layer of flexibility in terms of convenience, resource allocation and horizontal scaling across user groups.

Author David Poole on 16 June 2020 David Poole's blog

Introduction

In this blog post I will outline how to get your data science / machine learning applications running on an OpenShift hosted instance of JupyterLab.

JupyterLab allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization and machine learning.

OpenShift is a family of containerization software from Red Hat built around Docker containers orchestrated and managed by Kubernetes.

Kubernetes at Safe Swiss Cloud

Learn more about the Kubernetes/OpenShift distribution as implemented at Safe Swiss Cloud

FIND OUT MORE

REQUEST A BRIEFING

Installing JupyterLab on OpenShift

For installation, I referred to the Red Hat training guides and blog posts in the “References and Further Reading” section below. Based on this information, I was able to quickly build and deploy a JupyterLab container called “custom-notebook” on Safe Swiss Cloud’s OpenShift environment. Once deployed, you will end up with a single user instance of JupyterLab with password authentication. I added a secured route so that the container and thus, JupyterLab, could be accessed externally from the browser with a convenient URL. As an alternative to JupyterLab, you can opt to deploy the multi user version called JupyterHub. The choice is yours. I went for the single user option on the basis that each user can be assigned a different OpenShift project (namespace), thus providing in-built user sandboxing.

Installing TensorFlow on JupyterLab

Since Google’s TensorFlow is one of the most widely used machine learning toolkits, I decided to start with that. Installation in JupyterLab could not be easier: Simply create a new notebook with a Python 3 kernel and enter “pip install tensorflow”. After a few minutes, the latest version 2.2.0 will be installed together with Keras, the high level API built on top of TensorFlow.

Installing alternative Machine Learning Toolkits

As well as TensorFlow, I could successfully “pip install” the following toolkits and packages: Facebook Prophet and PyTorch, Microsoft CNTK, Apache MXNet, Intel OpenCV and AWS Sagemaker.

Visualisation tools

The visualisation and tools packages sklearn, matplotlib, plotly and folium were also installed as part of various projects.

OpenShift container sizing considerations

In container (pod) terms, most machine learning toolkits are RAM hungry, so you will need to tune your container resource parameters beyond the default, for example the CNTK needs several GB RAM to even get past the install phase. You can do this by editing the pod YAML file and setting the “memory” resource to e.g. “6Gi”. CPU usage becomes relevant as soon as you start e.g. training a model, not for installations. Whenever you increase resources, the existing pod will be destroyed and a new pod deployed so please make sure that you have mounted the application directory to persistent storage otherwise your work will be lost (see next section).

Adding Persistent Storage

Since pods are ephemeral, your data will disappear whenever a pod is rebuilt. To avoid this, you need to mount your application directory onto persistent storage. For example, you can create a persistent volume claim (pvc) via the OpenShift GUI and modify your pod YAML to associate your application directory with your pvc.

Any Questions?

If you have questions or suggestions, please leave a comment.

References and Further Reading

Comments [1]

Fernando
July 7th, 2020

Is openshift compatible with gpu resources on cloud?

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-marketing	1 year	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Necessary" category .
JSESSIONID	session	Used for Cross Site Request Forgery (CSRF) protection
sdsc	session	Signed data service context cookie used for database routing to ensure consistency across all databases when a change is made. Used to ensure that user-inputted content is immediately available to the submitting user upon submission
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin to store whether or not the user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_D83559EP8M	2 years	This cookie is installed by Google Analytics.
browser_id	5 years	This cookie is used for identifying the visitor browser on re-visit to the website.
split	1 month	This cookie is used to evaluate the changes to the website by checking which multivariate test the user takes part in.

Cookie	Duration	Description
bcookie	1 year	Browser Identifier cookie to uniquely indentify devices accessing LinkedIn to detect abust on the platform and diagnostic purposes
bscookie	1 year	Used for remembering that a logged in user is verified by two factor authentication
lang	session	Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
li_gc	6 months	Used to store consent of guests regarding the use of cookies for non-essential purposes
li_mc	6 months	Used as a temporary cache to avoid database lookups for a member's consent for use of non-essential cookies and used for having consent information on the client side to enforce consent on the client side
lidc	24 hours	To facilitate data center selection