Redivis Documentation
API DocumentationRedivis Home
  • Introduction
  • Redivis for open science
    • FAIR data practices
    • Open access
    • Data repository characteristics
    • Data retention policy
    • Citations
  • Guides
    • Getting started
    • Discover & access data
      • Discover datasets
      • Apply to access restricted data
      • Create a study
    • Analyze data in a workflow
      • Reshape data in transforms
      • Work with data in notebooks
      • Running ML workloads
      • Example workflows
        • Analyzing large tabular data
        • Create an image classification model
        • Fine tuning a Large Language Model (LLM)
        • No-code visualization
        • Continuous enrollment
        • Select first/last encounter
    • Export & publish your work
      • Export to other environments
      • Build your own site with Observable
    • Create & manage datasets
      • Create and populate a dataset
      • Upload tabular data as tables
      • Upload unstructured data as files
      • Cleaning tabular data
    • Administer an organization
      • Configure access systems
      • Grant access to data
      • Generate a report
      • Example tasks
        • Emailing subsets of members
    • Video guides
  • Reference
    • Your account
      • Creating an account
      • Managing logins
      • Single Sign-On (SSO)
      • Workspace
      • Studies
      • Compute credits and billing
    • Datasets
      • Documentation
      • Tables
      • Variables
      • Files
      • Creating & editing datasets
      • Uploading data
        • Tabular data
        • Geospatial data
        • Unstructured data
        • Metadata
        • Data sources
        • Programmatic uploads
      • Version control
      • Sampling
      • Exporting data
        • Download
        • Programmatic
        • Google Data Studio
        • Google Cloud Storage
        • Google BigQuery
        • Embedding tables
    • Workflows
      • Workflow concepts
      • Documentation
      • Data sources
      • Tables
      • Transforms
        • Transform concepts
        • Step: Aggregate
        • Step: Create variables
        • Step: Filter
        • Step: Join
        • Step: Limit
        • Step: Stack
        • Step: Order
        • Step: Pivot
        • Step: Rename
        • Step: Retype
        • Step: SQL query
        • Variable selection
        • Value lists
        • Optimization and errors
        • Variable creation methods
          • Common elements
          • Aggregate
          • Case (if/else)
          • Date
          • DateTime
          • Geography
          • JSON
          • Math
          • Navigation
          • Numbering
          • Other
          • Statistical
          • String
          • Time
      • Notebooks
        • Notebook concepts
        • Compute resources
        • Python notebooks
        • R notebooks
        • Stata notebooks
        • SAS notebooks
        • Using the Jupyter interface
      • Access and privacy
    • Data access
      • Access levels
      • Configuring access
      • Requesting access
      • Approving access
      • Usage rules
      • Data access in workflows
    • Organizations
      • Administrator panel
      • Members
      • Studies
      • Workflows
      • Datasets
      • Permission groups
      • Requirements
      • Reports
      • Logs
      • Billing
      • Settings and branding
        • Account
        • Public profile
        • Membership
        • Export environments
        • Advanced: DOI configuration
        • Advanced: Stata & SAS setup
        • Advanced: Data storage locations
        • Advanced: Data egress configuration
    • Institutions
      • Administrator panel
      • Organizations
      • Members
      • Datasets
      • Reports
      • Settings and branding
    • Quotas and limits
    • Glossary
  • Additional Resources
    • Events and press
    • API documentation
    • Redivis Labs
    • Office hours
    • Contact us
    • More information
      • Product updates
      • Roadmap
      • System status
      • Security
      • Feature requests
      • Report a bug
Powered by GitBook
On this page
  • Workflow objective
  • 1. Choose and explore data
  • 2. Identify a base model to fine-tune
  • 3. Create a workflow
  • 4. Create a notebook and load model + data
  • 5. Prepare the Reddit data for inference
  • 7. Use the fine-tuned model to classify Reddit reviews
  • Next steps

Was this helpful?

Export as PDF
  1. Guides
  2. Analyze data in a workflow
  3. Example workflows

Fine tuning a Large Language Model (LLM)

Last updated 2 months ago

Was this helpful?

This guide demonstrates using a Redivis workflow to import an existing LLM and then use relevant data to fine tune it and run it on another similar set of data we are interested in.

Workflow objective

Here, we want to fine-tune a pre-trained "foundational" LLM so that it can be used to score reviews. We will leverage an existing dataset that contains a collection of Yelp reviews and their scores to perform the fine-tuning, and then apply this classification model to other reviews (from Reddit) that do not contain an accompanying score. The goal here is to demonstrate how Redivis can be used to leverage, modify, and ultimately apply state-of-the-art LLMs to novel data.

! We also suggest you recreate this workflow as we go to best learn the process.

1. Choose and explore data

For this workflow we'll need our initial data to train the model on (in this case Yelp reviews) and the data we want to apply the model to (Reddit posts). These data are already on Redivis, split across two datasets uploaded to the Redivis : and .

Yelp reviews

To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need.

It looks like there are two tables, one with reviews for testing a model and another with reviews for training a model. Clicking on each table in this interface shows that they both have two variables (label and text) and that the Train table has 650,000 records while the Test table has 50,000 records.

This data seems to be formatted exactly how we'll want to use it so we don't need to do additional cleaning .

Reddit

For this workflow, we just want to look at reviews from one specific subreddit which reviews mouse traps: MouseReview. If we click on the Subreddit variable name, we can see a searchable frequency table with all this variable's values. If we search MouseReview we can see that this dataset contains 26,801 posts.

To move forward with this workflow we'll want to train a model on the Yelp dataset, and filter and clean the Reddit table make it more usable with our model. In order to clean or transform data and do our analysis we'll need to create a workflow.

2. Identify a base model to fine-tune

This model is hosted on Hugging Face, so we could load it directly into our notebook at runtime. However, if our notebook uses restricted data, it might not have access to the external internet, in which case we'll need to load the model into a dataset on Redivis.

3. Create a workflow

At the top of any dataset page, we can click the Analyze in workflow button to get started working with this data.

You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.

Add the additional datasets by clicking the + Add data button in the top left corner of the workflow and searching for the dataset by name.

4. Create a notebook and load model + data

Once we've added all our datasets to the workflow, we can get started. To begin, we'll create a python notebook based on the Yelp reviews training data, by selecting that table and clicking the + Notebook button.

We'll also need to install a few additional dependencies to perform training and inference via Hugging Face python packages:

The general steps here are as follows:

  1. Load the training and test data from the Yelp reviews dataset

  2. Load the pretrained BERT model

  3. Train this base model on the Yelp data to create a fine-tuned model that can classify text reviews from a score of 0-4.

5. Prepare the Reddit data for inference

The Yelp data was ready to go as-is, with a simple text field for the review and integer value for the score. For the Reddit data, we just need to run a quick filter to choose posts from the appropriate sub-reddit.

Create a transform

Click on the Posts table in the Reddit dataset and press the + Transform button. This is the interface we will use to build our query.

The final step in a transform is selecting which variables we would like to populate the resulting output table. In this case we just need the variables title and selftext.

With everything in place we will run this transform to create a new table, by pressing the Run button in the top right corner.

7. Use the fine-tuned model to classify Reddit reviews

Finally, we can apply our fine-tuned model to the subset of Reddit comments that we want to analyze. Ultimately, we produce a single output table from the notebook, containing the reddit post and associate score generated by our model.

Next steps

We do recommend further familiarizing yourself with the examples and detailed documentation to take full advantage of the capabilities of Redivis notebooks:

has two tables with over 150 million Reddit posts and subreddit information, split across two tables. We can look more closely at the 33 variables in the Reddit posts, including univariate statistics.

We want to leverage an existing model that understands language and can generally be used for language classification. There are many open-source models that might meet our needs here; in this example, we'll use Google's .

The Redivis dataset for this model . You can also learn more about loading ML models into Redivis datasets in our .

To enable GPU acceleration, before starting the notebook, we'll choose a with an NVIDIA-L4 GPU, which costs about $0.75 per hour to run (we could use the default, free notebook for this analysis, but it would take substantially longer to execute).

With that, we can start our notebook! The full annotated notebook is embedded below, and also .

We will use a to clean the data, as transforms are best suited for reshaping data at scale. Even though we might be more comfortable with Python or R, this dataset table is 83GB and it will be much easier and faster to filter it in a transform rather than a notebook.

Add a step. Conceptually we want to keep records that are part of the subreddit we are interested in, and are not empty, deleted, or removed.

Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and them. Perhaps we want to look at other subreddits, or run the model on a larger sample of the Yelp data.

You can also this workflow to work on a similar analysis, or any table in this workflow to analyze elsewhere.

This dataset
BERT-base-cased model
viewable on Redivis
transform
Filter
Guide to running ML Workloads on Redivis
Python notebook reference
R notebook reference
This workflow is on Redivis
Demo organization
Yelp Reviews (Hugging Face)
Reddit
can be found here
export
accompanying guide
rerun
fork
custom compute configuration
Configure our notebook to include a GPU
Specify dependencies by clicking the "Dependencies" button
redivis-python library documentation
redivis-R library documentation