Creating an ML model

This guide demonstrates using a Redivis project to train an ML model on a set of images stored in a Redivis dataset.

Project objective

This is an example project demonstrating image classification via Convolutional Neural Networks. It imports an example dataset containing several thousand test and training images of cats and dogs, with which we can train and evaluate our model.

This project is heavily adapted from its initial publication at: https://gsurma.medium.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8

This project is on Redivis! We also suggest you recreate this project as we go to best learn the process.

1. Explore data

All the image data we need is contained in the Demo organization dataset Example data files.

We can go to this dataset to browse it's tables to understand the structure of the data it contains.

We see three tables here, and all of them are file index tables. That means that each table contains an index of the files (unstructured data) this dataset contains, sorted by the folder the administrator uploaded them into. We can click on the Files tab of the dataset to see each file individually, and click on it to see a preview.

This dataset has three groupings of files:

  • Training images (we will use these to build the model)

  • Test images (images not included in the training set that we can verify the model with)

  • Example file types (unrelated to this project)

If we click on the Tables tab, and click on the training images table, we can see high level information about this set of files. We can see that there are 25,000 files, and when we click the Cells tab, all of the file names we can see end in .jpg. We can hover on these to see a preview of the image, and we can click on the file_id variable to see a preview of the image with more information.

2. Create a project

At the top of this dataset page we can click the Analyze in project button to get started working with this data.

You can add this dataset to an existing project you already have access to, or create a new project to start from scratch.

3. Define a training set of images

We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with.

Define training set

We need to start by defining the training set, which conceptually means the set of images we know are cats and know are dogs to train the model on. Information about whether an image is a cat or dog is in the file name, so we need to pull it out into a new variable we can more easily sort on.

Click on the table Training images and create a transform. This interface is where we will define a query which will run against our source table and create an output table. You can choose to write the query in SQL but we will use the interface for this example since it is faster and easier to use.

Add a Create variables step and name the new variable is_cat. The method will be Regexp contains which allows us to easily identify presence of the string cat from the file_name variable. This new variable will be a boolean variable where true means the image contains a cat and false means it does not.

We want to include only some of our training set images into the training set we use to train the model, since we want to leave some aside to validate the model. So here we want to include exactly 5000 cat images and 5000 dog images. So we we will create a new variable rank and filter on it so that we only keep the first 5000 images of each type.

To do this, + Add block in the Create variables step and use the Rank method. This is an analytic method which means you will use the partition ability to partition on true and false values. For each partitioned value (true and false) a rank will be assigned.

Create a new Filter step. Conceptually we will keep records up to 5000 in the rank variable, which means it will include 5000 true values and 5000 false values.

The final step in the transform is deciding which variables we want in our output table. We will keep our new boolean variable is_cat to use later, along with the file_id and file_name

With everything in place we can run this transform by clicking the Run button in the top right corner.

4. Sanity check the output table

Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.

Click on the output table below the transform to view it. We can see that it contains 10,000 records which is exactly what we expected

We can also inspect each variable further. If we click on the is_cat variable we can see that there are 5000 true values and 5000 false values, which shows that our filtering was successful. We can also validate that the method we used to determine if an image is a cat or a dog worked by clicking on the Cells tab. Here we can see that records marked True have "cat" in their file name, and when we hover on the file_ID value to see a preview, the image clearly contains a cat.

Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.

5. Define validation set of images

We need to create a set of image files separate from our training set where we know if the image contains a cat or dog. This will be used to validate the model training.

Create a new transform and take all the same steps as we did in the previous transform, but we will change the filter to keep images ranked 5001-7500, rather than 1-5000.

We will keep the same variables as we did in our training model, and then run this transform.

When we run this transform and inspect the output table we see what we expect here as well. There are 5000 total files and we can validate a few of them visually on the Cells tab.

6. Training the model in a notebook

Next we want to train and test a model using Python code the help of various Python libraries. Transforms are more powerful than notebooks but are based on SQL and operate linearly with only a single output table allowed. In order to work in Python, R, Stata, or SAS to generate visuals and other outputs we will create a Notebook node on our Training data output table.

When you create the notebook for the first time it will start up. Notebooks must be running to execute code.

Install packages

Redivis notebooks come with many common packages preinstalled, and you can install additional packages by clicking the Dependencies button and importing libraries in the code.

Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.

import keras
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential, Model
from tensorflow.keras.optimizers import RMSprop
from keras.layers import Activation, Dropout, Flatten, Dense, GlobalMaxPooling2D, Conv2D, MaxPooling2D
from keras.callbacks import CSVLogger
from livelossplot.keras import PlotLossesCallback
import efficientnet.keras as efn
import redivis
import os

The main libraries used to create this model are Keras and Tensorflow. You can view their documentation for further details.

Load training and validation sets

Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook. For this example we will remove this sample code to import data according to our library's parameters.

TRAINING_LOGS_FILE = "training_logs.csv"
MODEL_SUMMARY_FILE = "model_summary.txt"
MODEL_FILE = "cats_vs_dogs.h5"

# Data
path = f"{os.getcwd()}/cats_and_dogs/"
training_data_dir = path + "training/"
validation_data_dir = path + "validation/" 
test_data_dir = path + "test/" 

Define model parameters

This is where we will heavily rely on our selected libraries to build the model.

# Hyperparams
IMAGE_SIZE = 200
IMAGE_WIDTH, IMAGE_HEIGHT = IMAGE_SIZE, IMAGE_SIZE
EPOCHS = 20
BATCH_SIZE = 32
TEST_SIZE = 30

input_shape = (IMAGE_WIDTH, IMAGE_HEIGHT, 3)
# CNN Model 5 (https://towardsdatascience.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8)
model = Sequential()

model.add(Conv2D(32, 3, 3, padding='same', input_shape=input_shape, activation='relu'))
model.add(Conv2D(32, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(64, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(64, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(128, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(128, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(256, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(256, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
            optimizer=RMSprop(learning_rate=0.0001),
            metrics=['accuracy'])

with open(MODEL_SUMMARY_FILE,"w") as fh:
    model.summary(print_fn=lambda line: fh.write(line + "\n"))
# Data augmentation
training_data_generator = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.1,
    zoom_range=0.1,
    horizontal_flip=True)
validation_data_generator = ImageDataGenerator(rescale=1./255)
test_data_generator = ImageDataGenerator(rescale=1./255)
# Data preparation
training_generator = training_data_generator.flow_from_directory(
    training_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=BATCH_SIZE,
    class_mode="binary")
validation_generator = validation_data_generator.flow_from_directory(
    validation_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=BATCH_SIZE,
    class_mode="binary")
test_generator = test_data_generator.flow_from_directory(
    test_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=1,
    class_mode="binary", 
    shuffle=False)

Model training

This is where we will train the model we just built using the image data we cleaned.

# Training
model.fit_generator(
    training_generator,
    steps_per_epoch=len(training_generator.filenames) // BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=validation_generator,
    validation_steps=len(validation_generator.filenames) // BATCH_SIZE,
    callbacks=[PlotLossesCallback(), CSVLogger(TRAINING_LOGS_FILE,
                                            append=False,
                                            separator=";")], 
    verbose=1)
model.save_weights(MODEL_FILE)

Evaluate model results

Now we will use the validation set to see how well our model works

# Testing
probabilities = model.predict_generator(test_generator, TEST_SIZE)
for index, probability in enumerate(probabilities):
    image_path = test_data_dir + "/" +test_generator.filenames[index]
    img = mpimg.imread(image_path)
    plt.imshow(img)
    if probability > 0.5:
        plt.title("%.2f" % (probability[0]*100) + "% dog")
    else:
        plt.title("%.2f" % ((1-probability[0])*100) + "% cat")
    plt.show()

Next steps

Perhaps we see something in this model we want to tweak, or we want to go back and change some of our underlying data. Projects are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them.

Notebooks can also create output tables which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also fork this project to work on a similar analysis, or export any table in this project for work elsewhere.

Last updated