Redivis Documentation
API DocumentationRedivis Home
  • Introduction
  • Redivis for open science
    • FAIR data practices
    • Open access
    • Data repository characteristics
    • Data retention policy
    • Citations
  • Guides
    • Getting started
    • Discover & access data
      • Discover datasets
      • Apply to access restricted data
      • Create a study
    • Analyze data in a workflow
      • Reshape data in transforms
      • Work with data in notebooks
      • Running ML workloads
      • Example workflows
        • Analyzing large tabular data
        • Create an image classification model
        • Fine tuning a Large Language Model (LLM)
        • No-code visualization
        • Continuous enrollment
        • Select first/last encounter
    • Export & publish your work
      • Export to other environments
      • Build your own site with Observable
    • Create & manage datasets
      • Create and populate a dataset
      • Upload tabular data as tables
      • Upload unstructured data as files
      • Cleaning tabular data
    • Administer an organization
      • Configure access systems
      • Grant access to data
      • Generate a report
      • Example tasks
        • Emailing subsets of members
    • Video guides
  • Reference
    • Your account
      • Creating an account
      • Managing logins
      • Single Sign-On (SSO)
      • Workspace
      • Studies
      • Compute credits and billing
    • Datasets
      • Documentation
      • Tables
      • Variables
      • Files
      • Creating & editing datasets
      • Uploading data
        • Tabular data
        • Geospatial data
        • Unstructured data
        • Metadata
        • Data sources
        • Programmatic uploads
      • Version control
      • Sampling
      • Exporting data
        • Download
        • Programmatic
        • Google Data Studio
        • Google Cloud Storage
        • Google BigQuery
        • Embedding tables
    • Workflows
      • Workflow concepts
      • Documentation
      • Data sources
      • Tables
      • Transforms
        • Transform concepts
        • Step: Aggregate
        • Step: Create variables
        • Step: Filter
        • Step: Join
        • Step: Limit
        • Step: Stack
        • Step: Order
        • Step: Pivot
        • Step: Rename
        • Step: Retype
        • Step: SQL query
        • Variable selection
        • Value lists
        • Optimization and errors
        • Variable creation methods
          • Common elements
          • Aggregate
          • Case (if/else)
          • Date
          • DateTime
          • Geography
          • JSON
          • Math
          • Navigation
          • Numbering
          • Other
          • Statistical
          • String
          • Time
      • Notebooks
        • Notebook concepts
        • Compute resources
        • Python notebooks
        • R notebooks
        • Stata notebooks
        • SAS notebooks
        • Using the Jupyter interface
      • Access and privacy
    • Data access
      • Access levels
      • Configuring access
      • Requesting access
      • Approving access
      • Usage rules
      • Data access in workflows
    • Organizations
      • Administrator panel
      • Members
      • Studies
      • Workflows
      • Datasets
      • Permission groups
      • Requirements
      • Reports
      • Logs
      • Billing
      • Settings and branding
        • Account
        • Public profile
        • Membership
        • Export environments
        • Advanced: DOI configuration
        • Advanced: Stata & SAS setup
        • Advanced: Data storage locations
        • Advanced: Data egress configuration
    • Institutions
      • Administrator panel
      • Organizations
      • Members
      • Datasets
      • Reports
      • Settings and branding
    • Quotas and limits
    • Glossary
  • Additional Resources
    • Events and press
    • API documentation
    • Redivis Labs
    • Office hours
    • Contact us
    • More information
      • Product updates
      • Roadmap
      • System status
      • Security
      • Feature requests
      • Report a bug
Powered by GitBook
On this page
  • Overview
  • Step structure
  • Field definitions
  • Examples
  • Example 1: Basic collapse
  • Example 2: Multiple aggregation variables
  • Example 3: Drop duplicates

Was this helpful?

Export as PDF
  1. Reference
  2. Workflows
  3. Transforms

Step: Aggregate

Overview

The Aggregate step collapses rows that are identical across a set of variables, optionally creating new aggregate variables in the process.

Example starting data:

/*---------+------------+---------+--------*
 | test    | date       | student | score  |
 +---------+------------+---------+--------+
 | quiz    | 2020-04-01 | jane    | 83     |
 | quiz    | 2020-04-01 | pat     | 35     |
 | midterm | 2020-05-01 | jane    | 74     |
 | midterm | 2020-05-01 | pat     | 62     |
 *---------+------------+---------+--------*/

Example output data

Collapsing on variables test and date, and creating new variable average_score to aggregate data from the score variable.

/*---------+-------------+---------------*
 | test    | date        | average_score |
 +---------+-------------+---------------|
 | quiz    | 2020-04-01  | 59            |
 | midterm | 2020-05-01  | 68            |
 *---------+-------------+---------------*/

Step structure

  • There is one collapse block where we will define how the data will be reshaped. On execution:

    • Your data will be cut to only include the variables chosen in this block.

    • Duplicates records across the collapsed variables will be dropped.

  • There can be one or more aggregation blocks where we can capture aggregate information in a newly created variable.

    • You can successfully collapse your table without creating any new variables in aggregation blocks.

    • Each aggregation block in the step represents one new variable in your output data.

    • Aggregation blocks how you can capture information about records dropped in your collapse block.

Field definitions

Collapse block:

Field
Description

Variables to collapse on

All variables you want to include in your output.

Aggregation block(s):

Field
Description

Name

The name of the new variable being created.

Aggregation method

How the new variable will be summarized (e.g. SUM or COUNT).

[Method fields]

Examples

Example 1: Basic collapse

We have test score data recorded per test, student, and date. However we want to know the average score on each test overall.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Variables to collapse on: Of the variables in our table, the only ones we want in our final output are test and date, so we select these here. We leave out student because that information doesn't matter to us anymore, and we leave out score because we are creating a new variable to replace it with aggregated information.

  • Name: We give our new variable a descriptive name average_score.

  • Aggregation method: We want to average all values in the grades variable per test, so we choose Average.

  • Variable to aggregate: Here is where we choose score as the variable containing the data we want to Average. If we had chosen a different Aggregation method, we might have different input fields here to answer.

Execution:

All variables are removed that aren't collapsed on or used in aggregation.

/*---------+-------+------------*
 | test    | score | date       |
 +---------+-------+------------+
 | quiz    | 83    | 2020-04-01 |
 | quiz    | 35    | 2020-04-01 |
 | quiz    | 89    | 2020-04-01 |
 | midterm | 74    | 2020-05-01 |
 | midterm | 62    | 2020-05-01 |
 | midterm | 93    | 2020-05-01 |
 | final   | 77    | 2020-06-01 |
 | final   | 59    | 2020-06-01 |
 | final   | 92    | 2020-06-01 |
 *---------+-------+------------*/

The Average is created of the score where test and date (our collapsed on variables) have the exact same values.

/*---------+-------+------------+---------------*
 | test    | score | date       | average_score |
 +---------+-------+------------+---------------+
 | quiz    | 83    | 2020-04-01 | 69            |
 | quiz    | 35    | 2020-04-01 | 69            |
 | quiz    | 89    | 2020-04-01 | 69            |
 | midterm | 74    | 2020-05-01 | 76.3333       |
 | midterm | 62    | 2020-05-01 | 76.3333       |
 | midterm | 93    | 2020-05-01 | 76.3333       |
 | final   | 77    | 2020-06-01 | 76            |
 | final   | 59    | 2020-06-01 | 76            |
 | final   | 92    | 2020-06-01 | 76            |
 *---------+-------+------------+---------------*/

The score variable is removed, since it was not collapsed on.

/*---------+------------+---------------*
 | test    | date       | average_score |
 +---------+------------+---------------+
 | quiz    | 2020-04-01 | 69            |
 | quiz    | 2020-04-01 | 69            |
 | quiz    | 2020-04-01 | 69            |
 | midterm | 2020-05-01 | 76.3333       |
 | midterm | 2020-05-01 | 76.3333       |
 | midterm | 2020-05-01 | 76.3333       |
 | final   | 2020-06-01 | 76            |
 | final   | 2020-06-01 | 76            |
 | final   | 2020-06-01 | 76            |
 *---------+------------+---------------*/

Then all exact duplicate records are dropped, to create the output.

Output data:

/*---------+-------------+---------------*
 | test    | date        | average_score |
 +---------+-------------+---------------|
 | quiz    | 2020-04-01  | 69            |
 | midterm | 2020-05-01  | 76.3333       |
 | final   | 2020-06-01  | 76            |
 *---------+-------------+---------------*/

Example 2: Multiple aggregation variables

We can build on our previous example, but say we want to know additionally how many students took each test.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

2nd aggregation block

  • Name: We want to give the new variable a descriptive name unique from our other block (and other variables in our table), in this case test_count.

  • Aggregation method: We want to count all values in the test variable per date, so we choose Count.

  • Variable to count: We choose test as the variable containing the data we want to Count. Since none of our variables have null entries, we could choose any variable here and get the same result. If we did have nulls, they would not be included in the Count. Conceptually we do not want to only include distinct values in our count so we leave that off

Output data:

/*---------+-------------+---------------+-------------*
 | test    | date        | average_score | test_count  |
 +---------+-------------+---------------+-------------+
 | quiz    | 2020-04-01  | 69            | 3           |
 | midterm | 2020-05-01  | 76.3333       | 3           |
 | final   | 2020-06-01  | 76            | 3           |
 *---------+-------------+---------------+-------------*/

Example 3: Drop duplicates

This step can be used to drop duplicated records even in cases where no aggregation happens.

Let's say we have data that we know had duplicate records that we don't need.

Starting data:

/*---------+-----------+------------*
 | test    | questions | date       |
 +---------+-----------+------------+
 | quiz    | 10        | 2020-04-01 |
 | quiz    | 35        | 2020-04-01 |
 | quiz    | 10        | 2020-04-01 |
 | midterm | 20        | 2020-05-01 |
 | midterm | 20        | 2020-05-01 |
 | midterm | 20        | 2020-05-01 |
 | final   | 45        | 2020-06-01 |
 | final   | 45        | 2020-06-01 |
 | final   | 45        | 2020-06-01 |
 *---------+-----------+------------*/

Input fields:

Variables to collapse on: To drop all records that are an exact duplicate across all variables, we just need the collapse block with no aggregation blocks. We need to select all variables here and can do so by typing all of them out, or inputting *.

Output data:

/*---------+-----------+------------*
 | test    | questions | date       |
 +---------+-----------+------------+
 | quiz    | 10        | 2020-04-01 |
 | midterm | 20        | 2020-05-01 |
 | final   | 45        | 2020-06-01 |
 *---------+-----------+------------*/

Last updated 6 months ago

Was this helpful?

After you select your aggregation method, you will be prompted to input the information your chosen method needs to execute. You can see more specifics in the section.

Variable creation methods