Step: Create variables

Overview

A Create variables block uses methods to make new variables based on existing data. The method selected will dictate how the block operates.

Example starting data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Creating a new variable for what letter grade each student got on their test.

/*---------+--------+-------*
 | student | score  | grade |
 +---------+--------+-------+
 | jane    | 83     | B     |
 | neal    | 35     | F     |
 | sam     | 74     | C     |
 | pat     | 62     | D     |
 *---------+--------+-------*/

Step structure

  • There will be at least one new variable block where you will define and complete a new variable method.

  • When multiple blocks exist, the variables will be created in sequence and can reference each other.

Field definitions

FieldDefinition

Name

The name of the variable being created. This must follow all naming standards

Method

Some methods are only available for certain variable types, so you might need to retype variables before you can use them in the method you've chosen.

Analytic methods

Analytic methods are a special category of method that allow for each row to be computed individually. When using an analytic new variable method new tools become available:

  • A partition segments data based on values in the selected partition variables, and computes the analytic method within those segments separately.

  • A window defines which records are used to compute the analytic method. Usually this would be accompanied with an order clause.

Examples 3 and 4 below go into more detail for analytic methods.

Examples

Example 1: Date extract

A simple new variable format is extracting one part of a variable into a new one. In our data, we have a full date including year, month, and day, but we want to extract the year for use elsewhere.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Name: We name our new variable test_month.

  • Method: Once we chose Date extract as our method, new fields appear

    • Variable to extract: The variable with the data we want to extract from (date).

    • Date part: The part of the variable we want to extract. Since date is a Date type variable the information about date part is stored in the format and can be easily extracted. We choose Month since that is what we want to create.

  • More methods to choose from can be found in the Variable creation methods.

Output data:

/*---------+-------+---------+------------+--------------*
 | test    | score | student | date       | test_month   |
 +---------+-------+---------+------------+--------------+
 | quiz    | 83    | jane    | 2020-04-01 | April        |
 | quiz    | 35    | pat     | 2020-04-01 | April        |
 | quiz    | 89    | sam     | 2020-04-01 | April        |
 | midterm | 74    | jane    | 2020-05-01 | May          |
 | midterm | 62    | pat     | 2020-05-01 | May          |
 | midterm | 93    | sam     | 2020-05-01 | May          |
 | final   | 77    | jane    | 2020-06-01 | June         |
 | final   | 59    | pat     | 2020-06-01 | June         |
 | final   | 92    | sam     | 2020-06-01 | June         |
 *---------+-------+---------+------------+--------------*/

Example 2: Case method

The Case method (if/else) allows us to specify one or more conditions to create the values of the new variable.

For example, we can create a variable capturing the grade of each test in our data.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • When run, each row of the table will be evaluated in each section of the case statement (If, Else if, Else set to) until it matches something and is set to the corresponding value.

    • After the first section where a row meets the criteria, the new variable value will be set and no other sections will be evaluated.

  • Comparison statements used here will operate the same as they do in the Filter step and can be nested in the same way.

Output data:

/*---------+-------+---------+------------+----------------*
 | test    | score | student | date       | letter_grade   |
 +---------+-------+---------+------------+----------------+
 | quiz    | 83    | jane    | 2020-04-01 | B              |
 | quiz    | 35    | pat     | 2020-04-01 | F              |
 | quiz    | 89    | sam     | 2020-04-01 | B              |
 | midterm | 74    | jane    | 2020-05-01 | C              |
 | midterm | 62    | pat     | 2020-05-01 | F              |
 | midterm | 93    | sam     | 2020-05-01 | A              |
 | final   | 77    | jane    | 2020-06-01 | C              |
 | final   | 59    | pat     | 2020-06-01 | F              |
 | final   | 92    | sam     | 2020-06-01 | A              |
 *---------+-------+---------+------------+----------------*/

Example 3: Partitioned analytic methods

We can use an analytic method to compute a value for each row (rather than the entire table). By using a partition we can define groups of rows to calculate across.

For example, in our grades data we can calculate the average score of each test.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Method: We want to calculate the average so select this analytic method.

  • Partition: This is where we define the groups in which the average will be calculated.

    • If one or more variables are selected here, the average for our new score_average variable will be computed across all rows that are the same in the selected variable(s).

    • If no variables are entered here, then the average will be computed across the entire table.

    • We want to average scores from the same test so we select test here.

  • Variable to aggregate: This variable contains the data we want to average.

  • Window: We don't want to define a window in this average so we leave it on All rows in the partition.

Output data:

/*---------+-------+---------+------------+---------------*
 | test    | score | student | date       | score_average |
 +---------+-------+---------+------------+---------------+
 | quiz    | 83    | jane    | 2020-04-01 | 69            |
 | quiz    | 35    | pat     | 2020-04-01 | 69            |
 | quiz    | 89    | sam     | 2020-04-01 | 69            |
 | midterm | 74    | jane    | 2020-05-01 | 76.3333       |
 | midterm | 62    | pat     | 2020-05-01 | 76.3333       |
 | midterm | 93    | sam     | 2020-05-01 | 76.3333       |
 | final   | 77    | jane    | 2020-06-01 | 76            |
 | final   | 59    | pat     | 2020-06-01 | 76            |
 | final   | 92    | sam     | 2020-06-01 | 76            |
 *---------+-------+---------+------------+---------------*/

To create the new variable, for each row with the same value in test, the values in score will be averaged together. So there are 3 rows with the value quiz in test which average to 69. So all rows with a quiz will show the same average of 69 in our new variable.

Example 4: Windowed analytic methods

We can use a window to calculate a moving average. A window will define how many rows before or after the current row to use when calculating the average.

Example data:

/*-------+----------+------------*
 | score | student  | date       |
 +-------+----------+------------+
 | 10    | quiz1    | 2020-01-01 |
 | 10    | quiz1    | 2020-02-01 |
 | 40    | quiz1    | 2020-03-01 |
 | 30    | quiz2    | 2020-04-01 |
 *-------+---------+------------*/

Input fields:

  • Partition: We choose not to use a partition in this example since our data does not need to be segmented.

  • Variable to aggregate: The numbers we want to average are in the score variable so we choose that here.

  • Window: We want to create a moving average based on one entry before and after the current row, so we select Rows here.

  • Order by: Our data is already ordered by date in this table, but if it wasn't we would definitely need to order on the relevant variable here.

  • Rows preceding / Rows following: This is where we define how many rows to include in the average.

Output data:

/*-------+---------+------------*---------------+
 | score | test    | date       | score_average |
 +-------+---------+------------+---------------+
 | 10    | quiz1   | 2020-01-01 | 10            |
 | 10    | quiz1   | 2020-02-01 | 20            |
 | 40    | quiz1   | 2020-03-01 | 26.6667       |
 | 30    | quiz2   | 2020-04-01 | 35            |
 *-------+---------+------------+---------------*/

For each row, one row preceding and following is used to compute the score_average. So for the first row we average 10 and 10 (since no preceding rows exist, it is excluded). For the second row 10, 10, and 40 are averaged. This process repeats until the end of the table is reached.

Note that we could also use Range instead of Rows for our window if our question was time based (e.g. average score over 1 month preceding and 1 month following).

Example 5: Windowed & partitioned analytic methods

Continuing the previous example, if we had included a partition then this same process would be completed separately for the values in each partition. So if we had partitioned on test in this example, our outcome would look different

Output data:

/*-------+---------+------------*---------------+
 | score | test    | date       | score_average |
 +-------+---------+------------+---------------+
 | 10    | quiz1   | 2020-01-01 | 10            |
 | 10    | quiz1   | 2020-02-01 | 20            |
 | 40    | quiz1   | 2020-03-01 | 25            |
 | 30    | quiz2   | 2020-04-01 | 30            |
 *-------+---------+------------+---------------*/

Since quiz2 is in a separate partition, those rows are averaged separately.

Last updated