Noah Trenaman
projects
posts
contact

Colab + S3: Using Google Colab With A Local IDE

June 27, 2020

I've never been a huge fan of computational notebooks when doing deep learning experiments, but Google Colab is an exception. Colab is an online Jupyter Notebook that, most notably, provides a ridiculous amount of compute for free. On a good day, you can get a session with a Tesla P100 GPU (16 GB VRAM) or Google's v2-8 TPU device (64 GB VRAM). You can also get more reliable access to the best tier of hardware with Colab Pro.

While Colab provides access to some serious computing resources, it can be difficult to use compared to a dedicated IDE such as VS Code or PyCharm. When you have a codebase that spans multiple Python modules (files), Colab can get a bit unwieldy. For example, I was working on a project that has 12 code files and over 2,000 lines of code, which would be a major headache to try debugging in a Colab session. It seems that a lot of research code is this way. On the other hand, Colab provides a lot more compute than I have locally, which would be useful for running experiments. Luckily, there's a way to get the best of both worlds by automating the code sync between your local IDE and a Colab session. This helps solve the common problem of switching between an IDE and a cloud-powered notebook:

Notebooks do not have all of the features of an IDE, like integrated documentation or sophisticated autocomplete, so participants often switch back and forth between an IDE (e.g., VS Code) and their notebook. One participant we observed kept both windows side-by-side and copy and pasted code between the two windows rapidly as they worked. — What's wrong with computational notebooks?

This may be a fairly common problem for a lot of people hoping to use Colab, so I thought I'd share my workflow. While it's not ideal, I find it much smoother and more productive than copy-pasting or using the Colab filesystem UI. I think the real value in Jupyter Notebooks is in presenting polished work, and less so for building out and testing complex deep learning systems. This is a way to get the compute power of Colab without sacrificing your IDE.

Ingredients

  • A local IDE or text editor (VS Code, PyCharm, Vim, etc.)
  • An Amazon Web Services account and an S3 bucket (it's free!)
  • upload.py, a script that uploads local code to your S3 bucket
  • Some scripts to run at the top of your Colab notebook to give it access to your S3 bucket

Creating your local project

By developing locally, you can use your favorite IDE and also break your project into multiple files. I tend to have dedicated Python modules for organizing dataset pipelines and for defining different models. Your project should just be a folder full of Python files and any other code files, like shell scripts or JSON/YAML for config.

Creating an S3 Bucket

To share code from your local codebase to your Colab notebook, we'll use an S3 bucket. This is a simple cloud-based data storage system from AWS. S3 has a free tier as long as your files are less than 5 GB, which is more than enough for a couple files of code.

Once you've created an AWS account, you can create an S3 bucket following these steps. Remember the name of your bucket as it will be used later.

You'll also need to note your AWS credentials, including an "access key" and "secret access key." These need to be set up as local environment variables, and they will also be used in your Colab notebook.

Running upload.py

Add upload.py to your local project. Update the BUCKET_NAME and make sure that your AWS credentials are set. Then you can run upload.py and it will add all of your code files to your S3 bucket. For example:

Uploading models/transformer.py Uploading models/vision_backbone.py Uploading upload.py Uploading main.py

Accessing your code on Colab

s3fs is a program that lets you treat an S3 bucket as an extension of your filesystem. (Once set up, your S3 bucket will appear as just another directory in Colab's filesystem.) You can set your AWS credentials on a Colab notebook and then install s3fs and point it to your bucket. The full set of scripts to add to the top of your Colab notebook is here.

Profit

Now, you should be able to develop and debug code with your local IDE and access it in a Colab session after running upload.py. Whenever you make an update to your local code, you can re-run upload.py and then reset your Colab session with the updated code. A hotkey I use often is ctrl+m+. to quickly restart a Colab session to run updated code. Another useful pattern is to run multiple experiments by uploading separate configs to each Colab session.

Other Approaches

Using ngrok and Google Drive: Transform Google Colab to a GPU instance with full SSH access