Fine tuning a Large Language Model (LLM)
Last updated
Was this helpful?
Last updated
Was this helpful?
This guide demonstrates using a Redivis workflow to import an existing LLM and then use relevant data to fine tune it and run it on another similar set of data we are interested in.
Here, we want to fine-tune a pre-trained "foundational" LLM so that it can be used to score reviews. We will leverage an existing dataset that contains a collection of Yelp reviews and their scores to perform the fine-tuning, and then apply this classification model to other reviews (from Reddit) that do not contain an accompanying score. The goal here is to demonstrate how Redivis can be used to leverage, modify, and ultimately apply state-of-the-art LLMs to novel data.
This workflow is on Redivis! We also suggest you recreate this workflow as we go to best learn the process.
For this workflow we'll need our initial data to train the model on (in this case Yelp reviews) and the data we want to apply the model to (Reddit posts). These data are already on Redivis, split across two datasets uploaded to the Redivis Demo organization: Yelp Reviews (Hugging Face) and Reddit.
To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need.
It looks like there are two tables, one with reviews for testing a model and another with reviews for training a model. Clicking on each table in this interface shows that they both have two variables (label
and text
) and that the Train table has 650,000 records while the Test table has 50,000 records.
This data seems to be formatted exactly how we'll want to use it so we don't need to do additional cleaning .
This dataset has two tables with over 150 million Reddit posts and subreddit information, split across two tables. We can look more closely at the 33 variables in the Reddit posts, including univariate statistics.
For this workflow, we just want to look at reviews from one specific subreddit which reviews mouse traps: MouseReview
. If we click on the Subreddit
variable name, we can see a searchable frequency table with all this variable's values. If we search MouseReview we can see that this dataset contains 26,801 posts.
To move forward with this workflow we'll want to train a model on the Yelp dataset, and filter and clean the Reddit table make it more usable with our model. In order to clean or transform data and do our analysis we'll need to create a workflow.
We want to leverage an existing model that understands language and can generally be used for language classification. There are many open-source models that might meet our needs here; in this example, we'll use Google's BERT-base-cased model.
This model is hosted on Hugging Face, so we could load it directly into our notebook at runtime. However, if our notebook uses restricted data, it might not have access to the external internet, in which case we'll need to load the model into a dataset on Redivis.
The Redivis dataset for this model can be found here. You can also learn more about loading ML models into Redivis datasets in our accompanying guide.
At the top of any dataset page, we can click the Analyze in workflow button to get started working with this data.
You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.
Add the additional datasets by clicking the + Add data button in the top left corner of the workflow and searching for the dataset by name.
Once we've added all our datasets to the workflow, we can get started. To begin, we'll create a python notebook based on the Yelp reviews training data, by selecting that table and clicking the + Notebook button.
To enable GPU acceleration, before starting the notebook, we'll choose a custom compute configuration with an NVIDIA-L4 GPU, which costs about $0.75 per hour to run (we could use the default, free notebook for this analysis, but it would take substantially longer to execute).
We'll also need to install a few additional dependencies to perform training and inference via Hugging Face python packages:
With that, we can start our notebook! The full annotated notebook is embedded below, and also viewable on Redivis.
The general steps here are as follows:
Load the training and test data from the Yelp reviews dataset
Load the pretrained BERT model
Train this base model on the Yelp data to create a fine-tuned model that can classify text reviews from a score of 0-4.
The Yelp data was ready to go as-is, with a simple text field for the review and integer value for the score. For the Reddit data, we just need to run a quick filter to choose posts from the appropriate sub-reddit.
We will use a transform to clean the data, as transforms are best suited for reshaping data at scale. Even though we might be more comfortable with Python or R, this dataset table is 83GB and it will be much easier and faster to filter it in a transform rather than a notebook.
Create a transform
Click on the Posts table in the Reddit dataset and press the + Transform button. This is the interface we will use to build our query.
Add a Filter step. Conceptually we want to keep records that are part of the subreddit we are interested in, and are not empty, deleted, or removed.
The final step in a transform is selecting which variables we would like to populate the resulting output table. In this case we just need the variables title
and selftext
.
With everything in place we will run this transform to create a new table, by pressing the Run button in the top right corner.
Finally, we can apply our fine-tuned model to the subset of Reddit comments that we want to analyze. Ultimately, we produce a single output table from the notebook, containing the reddit post and associate score generated by our model.
Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them. Perhaps we want to look at other subreddits, or run the model on a larger sample of the Yelp data.
You can also fork this workflow to work on a similar analysis, or export any table in this workflow to analyze elsewhere.
We do recommend further familiarizing yourself with the examples and detailed documentation to take full advantage of the capabilities of Redivis notebooks: