Remote Data Analysis with Jupyter and ngrok
In my previous post, I mentioned that we’re using Jupyter notebooks for a lot of our data analysis, even in R.
This post is a quick ‘howto’ for doing such analyses on remote compute severs (in the university data center, Amazon or Azure clouds, or whatever).
The Pieces
- Jupyter, with its Notebook system
- IRKernel, if you are using R
- ngrok to make the tunnel
- tmux to split the terminal screen & keep the session alive
I’m doing all of this on a Linux server; our current compute server runs Red Hat Enterprise Linux.
I install tmux
from my distribution repositories.
I download the ngrok
binary from the ngrok web site, and put it in
~/bin
(for myself) or /usr/local/bin
(so it’s
available for my students too).
For Jupyter and IRKernel, there are many ways to get them! I most often use Anaconda or Miniconda to install it, along with my Python and R:
$ conda install notebook
$ conda install -c r r r-irkernel
You also need to edit your Jupyter configuration to allow remote
connections; edit ~/.jupyter/jupyter_notebook_config.py
to
contain the following line:
c.NotebookApp.allow_remote_access = True
If this file doesn’t yet exist, you can ask Jupyter to generate it first:
jupyter notebook --generate-config
Finally, you will need to create an ngrok account and set up your ngrok installation to connect to it.
Setting Up a Session
First, I SSH in to the compute server, and go to my project directory:
localhost:~$ ssh big-data-monster.cs.txstate.edu
big-data-monster:~$ cd /data/my-project
I then launch a tmux
session to contain my analysis
& allow me to split my session into multiple terminals:
big-data-monster:~$ tmux
I typically split my tmux
into two panes with
Ctrl+B “. Once this is done, Ctrl+B o will jump
between panes.
In one pane, I start the notebook server:
big-data-monster:~$ jupyter notebook
In the other pane, I launch ngrok
; watch the notebook
server’s output to see the port to use:
big-data-monster:~$ ngrok http 8888
Once this is done, ngrok
will give you a URL to connect
to; connect to the HTTPS version, and you have your
notebook! Jupyter will prompt you for a token; enter the token (the part
after ?token=
in your Jupyter console output). If you don’t
have the token handy, start another terminal pane and run
jupyter notebook list
.
Benefits and Rationale
One of the big benefits of this setup is that it doesn’t require administering additional server software. Access control is handled entirely by the system’s SSH daemon, authentication, and file system permissions. Users with accounts on the compute server can spin up their own notebooks, with their own Python and/or R environments, and perform their analysis with minimal need for sysadmin support.
Using ngrok
instead of plain SSH tunnels
allows me to build the tunnel after the fact, without having to
anticipate what port will be available for my network server. When
several users share a compute server, Jupyter will automatically pick an
available port. I set up the ngrok
tunnel after the
notebook server is running, so I can tunnel whatever port it found.
Also, ngrok
is easier to set up than multi-hop tunnels
when connecting through a bastion host.
This setup is secure enough for most of our work. For an analysis
with sensitive human subjects data, however, I may drop
ngrok
and do the extra work to set up a plain
SSH tunnel for the notebook server.
Alternatives to ngrok
ngrok
is not the only source of this service; if you
want something that does not require account, try serveo:
ssh -R 443:localhost:8888 serveo.net
It will print out the URL you can connect to, and the rest proceeds
just like logging in with ngrok
.
Advanced Setup: Per-Project Environments
Conda, the Anaconda/Miniconda package manager, has a useful concept
of environments that allows you to have multiple installations
of Python, R, and libraries and switch between them on a per-project
basis. If each environment has notebook
installed, then
this setup method starts a notebook server that’s using the software
environment set up for that project.
I combine this with the wonderful direnv, which I’ve been using for a while to improve my shell environment setup.
I have the following in ~/.direnvrc
:
use_conda() {
source activate "$1"
}
I then put the following in the .envrc
file for a
particular project:
use conda doascience
When I cd
into the experiment directory, the
direnv
shell hooks will automatically activate the proper
Conda environment, and running Jupyter will spin up a notebook server
with that project’s requirements.