# Create an image classification model

This guide demonstrates using a Redivis workflow to train an ML model on a set of images stored in a Redivis dataset.

## Workflow objective <a href="#starting-your-project" id="starting-your-project"></a>

This is an example workflow demonstrating image classification via Convolutional Neural Networks. It imports an example dataset containing several thousand test and training images of cats and dogs, with which we can train and evaluate our model.

This workflow is heavily adapted from its initial publication at: <https://gsurma.medium.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8>

{% hint style="success" %}
[This workflow is on Redivis](https://redivis.com/projects/21p8-2h79wfgh8/notebooks/1638)! We also suggest you recreate this workflow as we go to best learn the process.
{% endhint %}

## 1. Explore data <a href="#starting-your-project" id="starting-your-project"></a>

All the image data we need is contained in the Demo organization dataset [Example data files](https://redivis.com/datasets/yz1s-d09009dbb).

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FFvTlM463hHkCgNJkSJbt%2FScreenshot%202024-12-09%20at%207.01.03%E2%80%AFPM_out.png?alt=media&#x26;token=ebb915ae-1ff5-499e-b77e-9199402ce1f4" alt=""><figcaption></figcaption></figure>

We can go to this dataset to browse it's tables to understand the structure of the data it contains.&#x20;

We see three tables here, and all of them are file index tables. That means that each table contains an index of the files (unstructured data) this dataset contains, sorted by the folder the administrator uploaded them into. We can click on the Files tab of the dataset to see each file individually, and click on it to see a preview.

This dataset has three groupings of files:

* Training images (we will use these to build the model)
* Test images (images not included in the training set that we can verify the model with)
* Example file types (unrelated to this workflow)

If we click on the **Tables** tab, and click on the training images table, we can see high level information about this set of files. We can see that there are 25,000 files, and when we click the **Cells** tab, all of the file names we can see end in .jpg. We can hover on these to see a preview of the image, and we can click on the `file_id` variable to see a preview of the image with more information.&#x20;

{% embed url="<https://redivis.com/embed/tables/a62k-fjj4nn6wj>" %}

## 2. Create a workflow

At the top of this dataset page we can click the **Analyze in workflow** button to get started working with this data.&#x20;

You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.&#x20;

## 3. Define a training set of images

We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with.&#x20;

#### Define training set

We need to start by defining the training set, which conceptually means the set of images we know are cats and know are dogs to train the model on. Information about whether an image is a cat or dog is in the file name, so we need to pull it out into a new variable we can more easily sort on.

Click on the table Training images and create a [transform](https://docs.redivis.com/reference/workflows/transforms). This interface is where we will define a query which will run against our source table and create an output table. You can choose to write the query in SQL but we will use the interface for this example since it is faster and easier to use.

Add a [**Create variables**](https://docs.redivis.com/reference/workflows/transforms/step-create-variables) step and name the new variable `is_cat`. The method will be **Regexp contains** which allows us to easily identify presence of the string `cat` from the `file_name` variable. This new variable will be a boolean variable where `true` means the image contains a cat and `false` means it does not.

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2Fvmk2WmhcexfyAVwX4uJW%2FScreenshot%202023-11-30%20at%203.45.18%20PM.png?alt=media&#x26;token=e2fb9cfb-5cb2-44ec-869b-cd6e6273797f" alt=""><figcaption></figcaption></figure>

We want to include only some of our training set images into the training set we use to train the model, since we want to leave some aside to validate the model. So here we want to include exactly 5000 cat images and 5000 dog images. So we we will create a new variable `rank` and filter on it so that we only keep the first 5000 images of each type.&#x20;

To do this, **+ Add block** in the **Create variables** step and use the **Rank** method. This is an [analytic method](https://docs.redivis.com/reference/workflows/transforms/step-create-variables#analytic-methods) which means you will use the partition ability to partition on true and false values. For each partitioned value (`true` and `false`) a rank will be assigned.&#x20;

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FEbJ1y6lwapUDcXFMfV1A%2FScreenshot%202023-11-30%20at%203.45.44%20PM.png?alt=media&#x26;token=8ec3f0cc-4eed-47d5-bf46-2a6b4d0c7f7e" alt=""><figcaption></figcaption></figure>

Create a new [Filter](https://docs.redivis.com/reference/workflows/transforms/step-filter) step. Conceptually we will keep records up to 5000 in the `rank` variable, which means it will include 5000 true values and 5000 false values.

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FcQpZ249jsdXPqWuif1xh%2FScreenshot%202023-11-30%20at%203.45.54%20PM.png?alt=media&#x26;token=4f837105-3a7c-45e3-b073-d40a24a9e961" alt=""><figcaption></figcaption></figure>

The final step in the transform is deciding which variables we want in our output table. We will keep our new boolean variable `is_cat` to use later, along with the `file_id` and `file_name`

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FewN2ens5Rl5BOkdEb6Eh%2FScreenshot%202023-11-30%20at%203.54.45%20PM.png?alt=media&#x26;token=9749edab-185e-4dfa-ab60-6ca379738292" alt=""><figcaption></figcaption></figure>

With everything in place we can run this transform by clicking the **Run** button in the top right corner.&#x20;

## 4. Sanity check the output table

Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.&#x20;

Click on the output table below the transform to view it. We can see that it contains 10,000 records which is exactly what we expected

We can also inspect each variable further. If we click on the `is_cat` variable we can see that there are 5000 true values and 5000 false values, which shows that our filtering was successful. We can also validate that the method we used to determine if an image is a cat or a dog worked by clicking on the **Cells** tab. Here we can see that records marked True have "cat" in their file name, and when we hover on the `file_ID` value to see a preview, the image clearly contains a cat.

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FQjwAuw3qrrZlMekZJ7YH%2FScreenshot%202024-12-09%20at%208.02.56%E2%80%AFPM_out.png?alt=media&#x26;token=cbe375c1-706c-4313-ba38-1b19fbdc9ec1" alt=""><figcaption></figcaption></figure>

Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.

## 5. Define validation set of images

We need to create a set of image files separate from our training set where we know if the image contains a cat or dog. This will be used to validate the model training.

Create a new **transform** and take all the same steps as we did in the previous transform, but we will change the filter to keep images ranked 5001-7500, rather than 1-5000.&#x20;

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FZEzODTVm8sd8c1KGAn2K%2FScreenshot%202023-11-30%20at%203.53.27%20PM.png?alt=media&#x26;token=d12cbdfd-3371-4236-a107-b49628e5f460" alt=""><figcaption></figcaption></figure>

We will keep the same variables as we did in our training model, and then run this transform.

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FXIgFNE9kP9c16otBYF4H%2FScreenshot%202023-11-30%20at%203.54.45%20PM.png?alt=media&#x26;token=ec5adf3b-48c0-4910-8382-0228ae11bab5" alt=""><figcaption></figcaption></figure>

When we run this transform and inspect the output table we see what we expect here as well. There are 5000 total files and we can validate a few of them visually on the **Cells** tab.&#x20;

## 6. Training the model in a notebook

Next we want to train and test a model using Python code the help of various Python libraries. Transforms are more powerful than notebooks but are based on SQL and operate linearly with only a single output table allowed. In order to work in Python, R, Stata, or SAS to generate visuals and other outputs we will create a [**Notebook**](https://docs.redivis.com/reference/workflows/notebooks) node on our Training data output table.&#x20;

When you create the notebook for the first time it will start up. Notebooks must be running to execute code.&#x20;

#### Install packages

Redivis notebooks come with many common packages preinstalled, and you can install additional packages by clicking the **Dependencies** button and importing libraries in the code.

Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.

```python
import keras
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential, Model
from tensorflow.keras.optimizers import RMSprop
from keras.layers import Activation, Dropout, Flatten, Dense, GlobalMaxPooling2D, Conv2D, MaxPooling2D
from keras.callbacks import CSVLogger
from livelossplot.keras import PlotLossesCallback
import efficientnet.keras as efn
import redivis
import os
```

The main libraries used to create this model are [Keras](https://keras.io/api/) and [Tensorflow](https://www.tensorflow.org/api_docs). You can view their documentation for further details.

#### Load training and validation sets

Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook. For this example we will remove this sample code to import data according to our library's parameters.

```python
TRAINING_LOGS_FILE = "training_logs.csv"
MODEL_SUMMARY_FILE = "model_summary.txt"
MODEL_FILE = "cats_vs_dogs.h5"

# Data
path = f"{os.getcwd()}/cats_and_dogs/"
training_data_dir = path + "training/"
validation_data_dir = path + "validation/" 
test_data_dir = path + "test/" 
```

#### Define model parameters

This is where we will heavily rely on our selected libraries to build the model.

```python
# Hyperparams
IMAGE_SIZE = 200
IMAGE_WIDTH, IMAGE_HEIGHT = IMAGE_SIZE, IMAGE_SIZE
EPOCHS = 20
BATCH_SIZE = 32
TEST_SIZE = 30

input_shape = (IMAGE_WIDTH, IMAGE_HEIGHT, 3)
```

```python
# CNN Model 5 (https://towardsdatascience.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8)
model = Sequential()

model.add(Conv2D(32, 3, 3, padding='same', input_shape=input_shape, activation='relu'))
model.add(Conv2D(32, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(64, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(64, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(128, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(128, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(256, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(256, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
            optimizer=RMSprop(learning_rate=0.0001),
            metrics=['accuracy'])

with open(MODEL_SUMMARY_FILE,"w") as fh:
    model.summary(print_fn=lambda line: fh.write(line + "\n"))
```

```python
# Data augmentation
training_data_generator = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.1,
    zoom_range=0.1,
    horizontal_flip=True)
validation_data_generator = ImageDataGenerator(rescale=1./255)
test_data_generator = ImageDataGenerator(rescale=1./255)
```

<pre class="language-python"><code class="lang-python"><strong># Data preparation
</strong>training_generator = training_data_generator.flow_from_directory(
    training_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=BATCH_SIZE,
    class_mode="binary")
validation_generator = validation_data_generator.flow_from_directory(
    validation_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=BATCH_SIZE,
    class_mode="binary")
test_generator = test_data_generator.flow_from_directory(
    test_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=1,
    class_mode="binary", 
    shuffle=False)
</code></pre>

#### Model training

This is where we will train the model we just built using the image data we cleaned.

```python
# Training
model.fit_generator(
    training_generator,
    steps_per_epoch=len(training_generator.filenames) // BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=validation_generator,
    validation_steps=len(validation_generator.filenames) // BATCH_SIZE,
    callbacks=[PlotLossesCallback(), CSVLogger(TRAINING_LOGS_FILE,
                                            append=False,
                                            separator=";")], 
    verbose=1)
model.save_weights(MODEL_FILE)
```

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2Fpq8g2K1IPsdAJVDdadnz%2FScreenshot%202023-11-30%20at%205.00.08%20PM.png?alt=media&#x26;token=99297b68-c403-4bfc-bb59-ad6463cc6e7d" alt=""><figcaption></figcaption></figure>

#### Evaluate model results

Now we will use the validation set to see how well our model works

```python
# Testing
probabilities = model.predict_generator(test_generator, TEST_SIZE)
for index, probability in enumerate(probabilities):
    image_path = test_data_dir + "/" +test_generator.filenames[index]
    img = mpimg.imread(image_path)
    plt.imshow(img)
    if probability > 0.5:
        plt.title("%.2f" % (probability[0]*100) + "% dog")
    else:
        plt.title("%.2f" % ((1-probability[0])*100) + "% cat")
    plt.show()
```

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FjEJJAtlipmaYVUX4XARA%2FScreenshot%202023-11-30%20at%205.01.42%20PM.png?alt=media&#x26;token=e0e9494d-fae9-4add-ae01-cb92e6f951e2" alt=""><figcaption></figcaption></figure>

## Next steps

Perhaps we see something in this model we want to tweak, or we want to go back and change some of our underlying data. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and [rerun](https://docs.redivis.com/reference/workflows/overview#run-all) them.&#x20;

Notebooks can also [create output tables](https://docs.redivis.com/reference/workflows/notebooks/notebook-concepts#outputting-tables) which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also [fork](https://docs.redivis.com/reference/workflows/overview#fork-the-project) this workflow to work on a similar analysis, or [export](https://docs.redivis.com/reference/tables/exporting-tables) any table in this workflow for work elsewhere.&#x20;
