Step: Aggregate

Overview

The Aggregate step collapses rows that are identical across a set of variables, optionally creating new aggregate variables in the process.

Example starting data:

/*---------+------------+---------+--------*
 | test    | date       | student | score  |
 +---------+------------+---------+--------+
 | quiz    | 2020-04-01 | jane    | 83     |
 | quiz    | 2020-04-01 | pat     | 35     |
 | midterm | 2020-05-01 | jane    | 74     |
 | midterm | 2020-05-01 | pat     | 62     |
 *---------+------------+---------+--------*/

Example output data

Collapsing on variables test and date, and creating new variable average_score to aggregate data from the score variable.

/*---------+-------------+---------------*
 | test    | date        | average_score |
 +---------+-------------+---------------|
 | quiz    | 2020-04-01  | 59            |
 | midterm | 2020-05-01  | 68            |
 *---------+-------------+---------------*/

Step structure

  • There is one collapse block where we will define how the data will be reshaped. On execution:

    • Your data will be cut to only include the variables chosen in this block.

    • Duplicates records across the collapsed variables will be dropped.

  • There can be one or more aggregation blocks where we can capture aggregate information in a newly created variable.

    • You can successfully collapse your table without creating any new variables in aggregation blocks.

    • Each aggregation block in the step represents one new variable in your output data.

    • Aggregation blocks how you can capture information about records dropped in your collapse block.

Field definitions

Collapse block:

Field
Description

Variables to collapse on

All variables you want to include in your output.

Aggregation block(s):

Field
Description

Name

The name of the new variable being created.

Aggregation method

How the new variable will be summarized (e.g. SUM or COUNT).

[Method fields]

After you select your aggregation method, you will be prompted to input the information your chosen method needs to execute. You can see more specifics in the Variable creation methods section.

Examples

Example 1: Basic collapse

We have test score data recorded per test, student, and date. However we want to know the average score on each test overall.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Variables to collapse on: Of the variables in our table, the only ones we want in our final output are test and date, so we select these here. We leave out student because that information doesn't matter to us anymore, and we leave out score because we are creating a new variable to replace it with aggregated information.

  • Name: We give our new variable a descriptive name average_score.

  • Aggregation method: We want to average all values in the grades variable per test, so we choose Average.

  • Variable to aggregate: Here is where we choose score as the variable containing the data we want to Average. If we had chosen a different Aggregation method, we might have different input fields here to answer.

Execution:

All variables are removed that aren't collapsed on or used in aggregation.

/*---------+-------+------------*
 | test    | score | date       |
 +---------+-------+------------+
 | quiz    | 83    | 2020-04-01 |
 | quiz    | 35    | 2020-04-01 |
 | quiz    | 89    | 2020-04-01 |
 | midterm | 74    | 2020-05-01 |
 | midterm | 62    | 2020-05-01 |
 | midterm | 93    | 2020-05-01 |
 | final   | 77    | 2020-06-01 |
 | final   | 59    | 2020-06-01 |
 | final   | 92    | 2020-06-01 |
 *---------+-------+------------*/

The Average is created of the score where test and date (our collapsed on variables) have the exact same values.

/*---------+-------+------------+---------------*
 | test    | score | date       | average_score |
 +---------+-------+------------+---------------+
 | quiz    | 83    | 2020-04-01 | 69            |
 | quiz    | 35    | 2020-04-01 | 69            |
 | quiz    | 89    | 2020-04-01 | 69            |
 | midterm | 74    | 2020-05-01 | 76.3333       |
 | midterm | 62    | 2020-05-01 | 76.3333       |
 | midterm | 93    | 2020-05-01 | 76.3333       |
 | final   | 77    | 2020-06-01 | 76            |
 | final   | 59    | 2020-06-01 | 76            |
 | final   | 92    | 2020-06-01 | 76            |
 *---------+-------+------------+---------------*/

The score variable is removed, since it was not collapsed on.

/*---------+------------+---------------*
 | test    | date       | average_score |
 +---------+------------+---------------+
 | quiz    | 2020-04-01 | 69            |
 | quiz    | 2020-04-01 | 69            |
 | quiz    | 2020-04-01 | 69            |
 | midterm | 2020-05-01 | 76.3333       |
 | midterm | 2020-05-01 | 76.3333       |
 | midterm | 2020-05-01 | 76.3333       |
 | final   | 2020-06-01 | 76            |
 | final   | 2020-06-01 | 76            |
 | final   | 2020-06-01 | 76            |
 *---------+------------+---------------*/

Then all exact duplicate records are dropped, to create the output.

Output data:

/*---------+-------------+---------------*
 | test    | date        | average_score |
 +---------+-------------+---------------|
 | quiz    | 2020-04-01  | 69            |
 | midterm | 2020-05-01  | 76.3333       |
 | final   | 2020-06-01  | 76            |
 *---------+-------------+---------------*/

Example 2: Multiple aggregation variables

We can build on our previous example, but say we want to know additionally how many students took each test.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

2nd aggregation block

  • Name: We want to give the new variable a descriptive name unique from our other block (and other variables in our table), in this case test_count.

  • Aggregation method: We want to count all values in the test variable per date, so we choose Count.

  • Variable to count: We choose test as the variable containing the data we want to Count. Since none of our variables have null entries, we could choose any variable here and get the same result. If we did have nulls, they would not be included in the Count. Conceptually we do not want to only include distinct values in our count so we leave that off

Output data:

/*---------+-------------+---------------+-------------*
 | test    | date        | average_score | test_count  |
 +---------+-------------+---------------+-------------+
 | quiz    | 2020-04-01  | 69            | 3           |
 | midterm | 2020-05-01  | 76.3333       | 3           |
 | final   | 2020-06-01  | 76            | 3           |
 *---------+-------------+---------------+-------------*/

Example 3: Drop duplicates

This step can be used to drop duplicated records even in cases where no aggregation happens.

Let's say we have data that we know had duplicate records that we don't need.

Starting data:

/*---------+-----------+------------*
 | test    | questions | date       |
 +---------+-----------+------------+
 | quiz    | 10        | 2020-04-01 |
 | quiz    | 35        | 2020-04-01 |
 | quiz    | 10        | 2020-04-01 |
 | midterm | 20        | 2020-05-01 |
 | midterm | 20        | 2020-05-01 |
 | midterm | 20        | 2020-05-01 |
 | final   | 45        | 2020-06-01 |
 | final   | 45        | 2020-06-01 |
 | final   | 45        | 2020-06-01 |
 *---------+-----------+------------*/

Input fields:

Variables to collapse on: To drop all records that are an exact duplicate across all variables, we just need the collapse block with no aggregation blocks. We need to select all variables here and can do so by typing all of them out, or inputting *.

Output data:

/*---------+-----------+------------*
 | test    | questions | date       |
 +---------+-----------+------------+
 | quiz    | 10        | 2020-04-01 |
 | midterm | 20        | 2020-05-01 |
 | final   | 45        | 2020-06-01 |
 *---------+-----------+------------*/

Last updated