Running ML workloads

Overview

Notebooks on Redivis offer a performant and highly flexible environment for doing data analysis. This includes the ability to run state-of-the-art machine learning (ML) models, including the ability to train new models, fine-tune various existing models, and use these models to perform inference and generate novel outputs.

This guide is focused on a number of common use cases when running ML workloads on Redivis. Here we generally focus on using the Hugging Face + PyTorch ecosystem in Python, though these examples are broadly applicable to other ML libraries and languages.

For a detailed example of using Redivis to fine tune a large-language model, see the complementary example:

Fine tuning a Large Language Model (LLM)

1. Create a notebook with appropriate computation capacity

Training ML models and running inference can require a substantial amount of compute capacity, depending on various factors such as your model and dataset size, usage parameters, and performance goals.

The default, free notebook on Redivis offers 2CPUs and 32GB of RAM. While this may work for initial exploration, running practical machine learning workflows typically requires the availability of a GPU. When creating your notebook, you can choose a custom compute configuration to match your needs.

Redivis offers a number of custom compute configurations, mapping to the various machine types available to Google Cloud. We recommend starting with a more modest GPU for initial exploration, and then upgrading as needed when computational or performance bottlenecks are reached. For this example, we'll use the NVIDIA L4 GPU, which provides reasonable performance at a reasonable cost.

2. Define dependencies

The Redivis python notebook is based off the jupyter-pytorch notebook image, with PyTorch, CUDA bindings, and various common data science libraries pre-installed. However, if you require additional dependencies for your work, you can specify them under the "dependencies" section of your notebook.

2a. [Optional]: Pre-load external models when internet is disabled

If your notebook references export-restricted data, for security reasons, internet will be disabled while the notebook is running. This can present a challenge for some common approaches to ML in python, such as downloading a model or dataset from Hugging Face. For example, we might reference a model as follows:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

This code typically downloads the model from Hugging Face and caches it to our local disk. However, if internet is disabled, this command will hang and ultimately fail. Instead, we need to download the model during notebook startup, before the internet is disabled, as part of the post_install.sh script under the notebook's dependencies:

python -c '
from huggingface_hub import snapshot_download
snapshot_download(repo_id="sentence-transformers/all-MiniLM-L6-v2")
'

This will download the model weights and other files to the default Hugging Face cache directory, ~/.cache/huggingface/hub.

Now, within our notebook, we can load the cached model. Make sure to set local_files_only=True, so that Hugging Face doesn't try to connect to the internet to check for a newer version of the model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2", local_files_only=True)

3. Load a model

3a. Load a model from Redivis

Machine learning models can be stored directly within a Redivis dataset as unstructured files. For example, this dataset contains the various files that make up the bert-base-cased model on Hugging Face. We can then download the model to our notebook's local filesystem:

import redivis
table = redivis.organization("demo").dataset("huggingface_models").table("bert_base_cased")
table.download_files("/scratch/bert-base-cased")

And then reference this as a local model. E.g.:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("/scratch/bert-base-cased", num_labels=5)
tokenizer = AutoTokenizer.from_pretrained("/scratch/bert-base-cased")

When using models stored on Redivis, we don't have to worry about whether our notebook has internet access, nor do we need to rely on the future availability of that particular model on Hugging Face.

To download a model from Hugging Face and save it to a Redivis dataset, you can either download the files from Hugging Face + re-upload them to Redivis, or alternatively, you can use a notebook to programmatically upload the files. E.g.,:

from huggingface_hub import snapshot_download
import redivis

# Download the model files from Hugging Face
snapshot_download(repo_id="google-bert/bert-base-cased")

# Specify an existing dataset and table on Redivis. 
# Consult the python docs for how to programmatically create datasets (apidocs.redivis.com)
table = redivis.organization("demo").dataset("huggingface_models").table("bert_base_cased")

# Add the downloaded model files to the table
table.add_files(directory='/home/root/.cache/huggingface/hub/models--google-bert--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e')

3b. Load a model from an external source

If your notebook has internet access, you can also use any other models that may be available on the internet. For example, we can load the same bert-base-cased model directly from Hugging Face:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

# Do cool things!

Note that if you are working in a notebook with disabled internet, this approach won't work, and you'll need to use the methods mentioned in either 2a or 3a above.

4. Load data

As a final step, you'll likely want to load data into your notebook to either fine-tune the model, perform inference, or otherwise experiment. There are thousands of datasets on Redivis, and you can upload your own data as well. You can learn more about loading data into a notebook in the python notebooks documentation. As a quick example:

import redivis

# Load tabular data as a pandas data frame, or a number of other formats (arrow, polars, iterator, etc)
table = redivis.organization("demo").dataset("ghcn_daily_weather_data").table("stations")
df = table.to_pandas_dataframe()

# Download unstructured data as files to your local disk
files_table = redivis.organization("demo").dataset("chest_x_ray_8").table("images")
files_table.download_files("/scratch/xray_images")

Of course, assuming your notebook has access to the external internet, you can also call various APIs to load external data sources.

from datasets import load_dataset

# load a dataset from Hugging Face
hf_dataset = load_dataset("Yelp/yelp_review_full") 

Next steps

At this point, you have all the tools at your disposal to perform cutting edge ML research. But of course, what you do next is totally up to you. We do recommend further familiarizing yourself with the examples and detailed documentation to take full advantage of the capabilities of Redivis notebooks:

Last updated

Was this helpful?