Step: Pivot

Overview

The Pivot step rotates data values into new variables using aggregation.

Example starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 *---------+-------+---------+------------*/

Example output data:

Pivot on the test variable.

/*---------+------+---------+-------*
 | student | quiz | midterm | final |
 +---------+------+---------+-------+
 | jane    | 83   | 74      | 77    |
 | pat     | 35   | 62      | 59    |
 | sam     | 89   | 93      | NULL  |
 *---------+------+---------+-------*/

Step structure

  • There is one pivot block where we will define the pivot variable and values.

  • There are one or more aggregation blocks where we will define how the table will reshape and what data populates the newly created variables.

    • There must always be at least one aggregate block.

    • The total number of new variables in your output table will be the number of Pivot values defined * The number of aggregation blocks.

Input field definitions

Pivot block:

FieldDefinition

Pivot variable

The variable containing the values that will become new variables.

Variables to collapse on

All variables you want to include in your output, except for variables defined elsewhere in the pivot operation (pivot variable, variable to aggregate on).

Pivot value

A value from your pivot variable which will become a new variable in your output table. You can add multiple values. You do not need to include every value of this variable.

New variable name

An option to rename any of the new variables created in the pivot value field. Leaving this field blank will persist the value name into the variable name.

Aggregation block:

FieldDefinition

Alias

A value that will be prefixed to your new variable names. If you are aggregating one variable this field is optional, but if you are aggregating multiple then you'll need to specify a value for each one.

Aggregation method

How the variable to aggregate will be summarized (e.g. SUM, COUNT).

Variable to aggregate

Which variable's data will populate the new variables you are creating in the pivot value field.

Examples

Example 1: Basic pivot

Let's say we've recorded data about tests given and student scores and we want to pivot in order to see information sorted by student.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 74    | jane    | 2020-04-15 |
 | quiz    | 83    | sam     | 2020-04-15 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 91    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Pivot variable: Since we want to create new variables based on values in the student variable, we select that as our Pivot variable.

  • Variables to collapse on: We want to include all variables here that will be the final shape of our pivoted table that are not already defined as our pivot variable or variable to aggregate. We include test and date in this example.

  • Pivot value: We need to choose one or more values from our selected pivot variable (student) to become new variables in our output table. This does not have to include every value in this variable, even though we use all of them in this example.

  • New variable name: We don't want to rename any of these new variables, so we can leave this blank.

  • Alias (optional): Since we only have one variable to aggregate, we can leave this blank.

  • Aggregation method: We need to choose how our variable to aggregate (score) will be aggregated. In this example data we don't have duplicated values so what we choose here won't matter. So we choose SUM to validate the process.

  • Variable to aggregate: This is the variable (score) that will be removed from this table, and its values will be redistributed to our newly created variables (jane, pat, and sam) with the aggregation method we selected handling any duplicate values (in this example, SUMming them).

Output data:

/*---------+------------+------+------+------*
 | test    | date       | jane | pat  | sam  |
 +---------+------------+------+------+------+
 | quiz    | 2020-04-01 | 83   | 35   | 89   |
 | quiz    | 2020-04-15 | 74   | NULL | 83   |
 | midterm | 2020-05-01 | 74   | 62   | 93   |
 | final   | 2020-06-01 | 77   | NULL | 91   |
 *---------+------------+------+------+------*/

Example 2: Pivot with collapse

Continuing this example, let's say we want to average scores from the two quizzes. We can collapse the table by not including the date variable. Without that variable there would be two identical values in test (quiz) which will be collapsed into one.

Starting table:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 74    | jane    | 2020-04-15 |
 | quiz    | 83    | sam     | 2020-04-15 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 91    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

In this case it does matter what we select for Aggregation method, since the duplicate values for quiz will get aggregated. We select Average here.

Pivoted output table:

/*---------+------+------+------*
 | test    | jane | pat  | sam  |
 +---------+------+------+------+
 | quiz    | 78.5 | 35   | 86   |
 | midterm | 74   | 62   | 93   |
 | final   | 77   | NULL | 91   |
 *---------+------+------+------*/

Example 3: Multiple aggregation variables

We can add another variable aggregation block to gather more information about the variables being aggregated. Continuing this example, let's say we not only want to average the quiz scores, but count the number of averaged quizzes.

Starting table:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 74    | jane    | 2020-04-15 |
 | quiz    | 83    | sam     | 2020-04-15 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 91    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

Note that the Alias field is no longer optional since we have two aggregation blocks.

We make sure to select score for our Variable to count since we specifically want to know how many scores were included in our average. Since we have some null scores, selecting a different variable here might give us a different result.

Pivoted output table:

/*---------+---------------+--------------------+--------------+----------------*
 | test    | jane_scoreavg | jane_scorecount    | pat_scoreavg | pat_scorecount |
 +---------+---------------+--------------------+--------------+----------------+
 | quiz    | 78.5          | 2                  | 35           | 1              |
 | midterm | 74            | 1                  | 62           | 1              |
 | final   | 77            | 1                  | NULL         | 1              |
 *---------+---------------+--------------------+--------------+----------------*/

The jane_scoreavg variable contains data from the scoreavg aggregation block's method of Average for the values of student Jane, while the jane_scorecount variable contains data from the scorecount aggregation block.

Last updated