Step: Create variables
Overview
A Create variables block uses methods to make new variables based on existing data. The method selected will dictate how the block operates.
Example starting data:
Example output data:
Creating a new variable for what letter grade each student got on their test.
Step structure
There will be at least one new variable block where you will define and complete a new variable method.
When multiple blocks exist, the variables will be created in sequence and can reference each other.
Field definitions
Name
The name of the variable being created. This must follow all naming standards
Method
The way that the new variable will be created. Choosing this will bring up additional fields to complete specific to the chosen method. See all methods here.
Some methods are only available for certain variable types, so you might need to retype variables before you can use them in the method you've chosen.
Analytic methods
Analytic methods are a special category of method that allow for each row to be computed individually. When using an analytic new variable method new tools become available:
A partition segments data based on values in the selected partition variables, and computes the analytic method within those segments separately.
A window defines which records are used to compute the analytic method. Usually this would be accompanied with an order clause.
Examples 3 and 4 below go into more detail for analytic methods.
Examples
Example 1: Date extract
A simple new variable format is extracting one part of a variable into a new one. In our data, we have a full date including year, month, and day, but we want to extract the year for use elsewhere.
Starting data:
Input fields:
Name: We name our new variable
test_month
.Method: Once we chose
Date extract
as our method, new fields appearVariable to extract: The variable with the data we want to extract from (
date
).Date part: The part of the variable we want to extract. Since
date
is a Date type variable the information about date part is stored in the format and can be easily extracted. We chooseMonth
since that is what we want to create.
More methods to choose from can be found in the Variable creation methods.
Output data:
Example 2: Case method
The Case
method (if/else) allows us to specify one or more conditions to create the values of the new variable.
For example, we can create a variable capturing the grade of each test in our data.
Starting data:
Input fields:
When run, each row of the table will be evaluated in each section of the case statement (
If
,Else if
,Else set to
) until it matches something and is set to the corresponding value.After the first section where a row meets the criteria, the new variable value will be set and no other sections will be evaluated.
Comparison statements used here will operate the same as they do in the Filter step and can be nested in the same way.
Output data:
Example 3: Partitioned analytic methods
We can use an analytic method to compute a value for each row (rather than the entire table). By using a partition we can define groups of rows to calculate across.
For example, in our grades data we can calculate the average score of each test.
Starting data:
Input fields:
Method: We want to calculate the average so select this analytic method.
Partition: This is where we define the groups in which the average will be calculated.
If one or more variables are selected here, the average for our new
score_average
variable will be computed across all rows that are the same in the selected variable(s).If no variables are entered here, then the average will be computed across the entire table.
We want to average scores from the same test so we select
test
here.
Variable to aggregate: This variable contains the data we want to average.
Window: We don't want to define a window in this average so we leave it on
All rows in the partition
.
Output data:
To create the new variable, for each row with the same value in test
, the values in score
will be averaged together. So there are 3 rows with the value quiz
in test
which average to 69. So all rows with a quiz
will show the same average of 69 in our new variable.
Example 4: Windowed analytic methods
We can use a window to calculate a moving average. A window will define how many rows before or after the current row to use when calculating the average.
Example data:
Input fields:
Partition: We choose not to use a partition in this example since our data does not need to be segmented.
Variable to aggregate: The numbers we want to average are in the
score
variable so we choose that here.Window: We want to create a moving average based on one entry before and after the current row, so we select
Rows
here.Order by: Our data is already ordered by date in this table, but if it wasn't we would definitely need to order on the relevant variable here.
Rows preceding / Rows following: This is where we define how many rows to include in the average.
Output data:
For each row, one row preceding and following is used to compute the score_average
. So for the first row we average 10 and 10 (since no preceding rows exist, it is excluded). For the second row 10, 10, and 40 are averaged. This process repeats until the end of the table is reached.
Note that we could also use Range instead of Rows for our window if our question was time based (e.g. average score over 1 month preceding and 1 month following).
Example 5: Windowed & partitioned analytic methods
Continuing the previous example, if we had included a partition then this same process would be completed separately for the values in each partition. So if we had partitioned on test
in this example, our outcome would look different
Output data:
Since quiz2
is in a separate partition, those rows are averaged separately.
Last updated