Creating an ML model
Last updated
Last updated
This guide demonstrates using a Redivis workflow to train an ML model on a set of images stored in a Redivis dataset.
This is an example workflow demonstrating image classification via Convolutional Neural Networks. It imports an example dataset containing several thousand test and training images of cats and dogs, with which we can train and evaluate our model.
This workflow is heavily adapted from its initial publication at: https://gsurma.medium.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8
This workflow is on Redivis! We also suggest you recreate this workflow as we go to best learn the process.
All the image data we need is contained in the Demo organization dataset Example data files.
We can go to this dataset to browse it's tables to understand the structure of the data it contains.
We see three tables here, and all of them are file index tables. That means that each table contains an index of the files (unstructured data) this dataset contains, sorted by the folder the administrator uploaded them into. We can click on the Files tab of the dataset to see each file individually, and click on it to see a preview.
This dataset has three groupings of files:
Training images (we will use these to build the model)
Test images (images not included in the training set that we can verify the model with)
Example file types (unrelated to this workflow)
If we click on the Tables tab, and click on the training images table, we can see high level information about this set of files. We can see that there are 25,000 files, and when we click the Cells tab, all of the file names we can see end in .jpg. We can hover on these to see a preview of the image, and we can click on the file_id
variable to see a preview of the image with more information.
At the top of this dataset page we can click the Analyze in workflow button to get started working with this data.
You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.
We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with.
We need to start by defining the training set, which conceptually means the set of images we know are cats and know are dogs to train the model on. Information about whether an image is a cat or dog is in the file name, so we need to pull it out into a new variable we can more easily sort on.
Click on the table Training images and create a transform. This interface is where we will define a query which will run against our source table and create an output table. You can choose to write the query in SQL but we will use the interface for this example since it is faster and easier to use.
Add a Create variables step and name the new variable is_cat
. The method will be Regexp contains which allows us to easily identify presence of the string cat
from the file_name
variable. This new variable will be a boolean variable where true
means the image contains a cat and false
means it does not.
We want to include only some of our training set images into the training set we use to train the model, since we want to leave some aside to validate the model. So here we want to include exactly 5000 cat images and 5000 dog images. So we we will create a new variable rank
and filter on it so that we only keep the first 5000 images of each type.
To do this, + Add block in the Create variables step and use the Rank method. This is an analytic method which means you will use the partition ability to partition on true and false values. For each partitioned value (true
and false
) a rank will be assigned.
Create a new Filter step. Conceptually we will keep records up to 5000 in the rank
variable, which means it will include 5000 true values and 5000 false values.
The final step in the transform is deciding which variables we want in our output table. We will keep our new boolean variable is_cat
to use later, along with the file_id
and file_name
With everything in place we can run this transform by clicking the Run button in the top right corner.
Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.
Click on the output table below the transform to view it. We can see that it contains 10,000 records which is exactly what we expected
We can also inspect each variable further. If we click on the is_cat
variable we can see that there are 5000 true values and 5000 false values, which shows that our filtering was successful. We can also validate that the method we used to determine if an image is a cat or a dog worked by clicking on the Cells tab. Here we can see that records marked True have "cat" in their file name, and when we hover on the file_ID
value to see a preview, the image clearly contains a cat.
Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.
We need to create a set of image files separate from our training set where we know if the image contains a cat or dog. This will be used to validate the model training.
Create a new transform and take all the same steps as we did in the previous transform, but we will change the filter to keep images ranked 5001-7500, rather than 1-5000.
We will keep the same variables as we did in our training model, and then run this transform.
When we run this transform and inspect the output table we see what we expect here as well. There are 5000 total files and we can validate a few of them visually on the Cells tab.
Next we want to train and test a model using Python code the help of various Python libraries. Transforms are more powerful than notebooks but are based on SQL and operate linearly with only a single output table allowed. In order to work in Python, R, Stata, or SAS to generate visuals and other outputs we will create a Notebook node on our Training data output table.
When you create the notebook for the first time it will start up. Notebooks must be running to execute code.
Redivis notebooks come with many common packages preinstalled, and you can install additional packages by clicking the Dependencies button and importing libraries in the code.
Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.
The main libraries used to create this model are Keras and Tensorflow. You can view their documentation for further details.
Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook. For this example we will remove this sample code to import data according to our library's parameters.
This is where we will heavily rely on our selected libraries to build the model.
This is where we will train the model we just built using the image data we cleaned.
Now we will use the validation set to see how well our model works
Perhaps we see something in this model we want to tweak, or we want to go back and change some of our underlying data. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them.
Notebooks can also create output tables which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also fork this workflow to work on a similar analysis, or export any table in this workflow for work elsewhere.