Redivis Documentation
API DocumentationRedivis Home
  • Introduction
  • Redivis for open science
    • FAIR data practices
    • Open access
    • Data repository characteristics
    • Data retention policy
    • Citations
  • Guides
    • Getting started
    • Discover & access data
      • Discover datasets
      • Apply to access restricted data
      • Create a study
    • Analyze data in a workflow
      • Reshape data in transforms
      • Work with data in notebooks
      • Running ML workloads
      • Example workflows
        • Analyzing large tabular data
        • Create an image classification model
        • Fine tuning a Large Language Model (LLM)
        • No-code visualization
        • Continuous enrollment
        • Select first/last encounter
    • Export & publish your work
      • Export to other environments
      • Build your own site with Observable
    • Create & manage datasets
      • Create and populate a dataset
      • Upload tabular data as tables
      • Upload unstructured data as files
      • Cleaning tabular data
    • Administer an organization
      • Configure access systems
      • Grant access to data
      • Generate a report
      • Example tasks
        • Emailing subsets of members
    • Video guides
  • Reference
    • Your account
      • Creating an account
      • Managing logins
      • Single Sign-On (SSO)
      • Workspace
      • Studies
      • Compute credits and billing
    • Datasets
      • Documentation
      • Tables
      • Variables
      • Files
      • Creating & editing datasets
      • Uploading data
        • Tabular data
        • Geospatial data
        • Unstructured data
        • Metadata
        • Data sources
        • Programmatic uploads
      • Version control
      • Sampling
      • Exporting data
        • Download
        • Programmatic
        • Google Data Studio
        • Google Cloud Storage
        • Google BigQuery
        • Embedding tables
    • Workflows
      • Workflow concepts
      • Documentation
      • Data sources
      • Tables
      • Transforms
        • Transform concepts
        • Step: Aggregate
        • Step: Create variables
        • Step: Filter
        • Step: Join
        • Step: Limit
        • Step: Stack
        • Step: Order
        • Step: Pivot
        • Step: Rename
        • Step: Retype
        • Step: SQL query
        • Variable selection
        • Value lists
        • Optimization and errors
        • Variable creation methods
          • Common elements
          • Aggregate
          • Case (if/else)
          • Date
          • DateTime
          • Geography
          • JSON
          • Math
          • Navigation
          • Numbering
          • Other
          • Statistical
          • String
          • Time
      • Notebooks
        • Notebook concepts
        • Compute resources
        • Python notebooks
        • R notebooks
        • Stata notebooks
        • SAS notebooks
        • Using the Jupyter interface
      • Access and privacy
    • Data access
      • Access levels
      • Configuring access
      • Requesting access
      • Approving access
      • Usage rules
      • Data access in workflows
    • Organizations
      • Administrator panel
      • Members
      • Studies
      • Workflows
      • Datasets
      • Permission groups
      • Requirements
      • Reports
      • Logs
      • Billing
      • Settings and branding
        • Account
        • Public profile
        • Membership
        • Export environments
        • Advanced: DOI configuration
        • Advanced: Stata & SAS setup
        • Advanced: Data storage locations
        • Advanced: Data egress configuration
    • Institutions
      • Administrator panel
      • Organizations
      • Members
      • Datasets
      • Reports
      • Settings and branding
    • Quotas and limits
    • Glossary
  • Additional Resources
    • Events and press
    • API documentation
    • Redivis Labs
    • Office hours
    • Contact us
    • More information
      • Product updates
      • Roadmap
      • System status
      • Security
      • Feature requests
      • Report a bug
Powered by GitBook
On this page
  • Overview
  • Step structure
  • Field definitions
  • Analytic methods
  • Examples
  • Example 1: Date extract
  • Example 2: Case method
  • Example 3: Partitioned analytic methods
  • Example 4: Windowed analytic methods
  • Example 5: Windowed & partitioned analytic methods

Was this helpful?

Export as PDF
  1. Reference
  2. Workflows
  3. Transforms

Step: Create variables

Overview

A Create variables block uses methods to make new variables based on existing data. The method selected will dictate how the block operates.

Example starting data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Creating a new variable for what letter grade each student got on their test.

/*---------+--------+-------*
 | student | score  | grade |
 +---------+--------+-------+
 | jane    | 83     | B     |
 | neal    | 35     | F     |
 | sam     | 74     | C     |
 | pat     | 62     | D     |
 *---------+--------+-------*/

Step structure

  • There will be at least one new variable block where you will define and complete a new variable method.

  • When multiple blocks exist, the variables will be created in sequence and can reference each other.

Field definitions

Field
Definition

Name

The name of the variable being created. This must follow all naming standards

Method

Analytic methods

Analytic methods are a special category of method that allow for each row to be computed individually. When using an analytic new variable method new tools become available:

  • A partition segments data based on values in the selected partition variables, and computes the analytic method within those segments separately.

  • A window defines which records are used to compute the analytic method. Usually this would be accompanied with an order clause.

Examples 3 and 4 below go into more detail for analytic methods.

Examples

Example 1: Date extract

A simple new variable format is extracting one part of a variable into a new one. In our data, we have a full date including year, month, and day, but we want to extract the year for use elsewhere.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Name: We name our new variable test_month.

  • Method: Once we chose Date extract as our method, new fields appear

    • Variable to extract: The variable with the data we want to extract from (date).

    • Date part: The part of the variable we want to extract. Since date is a Date type variable the information about date part is stored in the format and can be easily extracted. We choose Month since that is what we want to create.

Output data:

/*---------+-------+---------+------------+--------------*
 | test    | score | student | date       | test_month   |
 +---------+-------+---------+------------+--------------+
 | quiz    | 83    | jane    | 2020-04-01 | April        |
 | quiz    | 35    | pat     | 2020-04-01 | April        |
 | quiz    | 89    | sam     | 2020-04-01 | April        |
 | midterm | 74    | jane    | 2020-05-01 | May          |
 | midterm | 62    | pat     | 2020-05-01 | May          |
 | midterm | 93    | sam     | 2020-05-01 | May          |
 | final   | 77    | jane    | 2020-06-01 | June         |
 | final   | 59    | pat     | 2020-06-01 | June         |
 | final   | 92    | sam     | 2020-06-01 | June         |
 *---------+-------+---------+------------+--------------*/

Example 2: Case method

The Case method (if/else) allows us to specify one or more conditions to create the values of the new variable.

For example, we can create a variable capturing the grade of each test in our data.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • When run, each row of the table will be evaluated in each section of the case statement (If, Else if, Else set to) until it matches something and is set to the corresponding value.

    • After the first section where a row meets the criteria, the new variable value will be set and no other sections will be evaluated.

Output data:

/*---------+-------+---------+------------+----------------*
 | test    | score | student | date       | letter_grade   |
 +---------+-------+---------+------------+----------------+
 | quiz    | 83    | jane    | 2020-04-01 | B              |
 | quiz    | 35    | pat     | 2020-04-01 | F              |
 | quiz    | 89    | sam     | 2020-04-01 | B              |
 | midterm | 74    | jane    | 2020-05-01 | C              |
 | midterm | 62    | pat     | 2020-05-01 | F              |
 | midterm | 93    | sam     | 2020-05-01 | A              |
 | final   | 77    | jane    | 2020-06-01 | C              |
 | final   | 59    | pat     | 2020-06-01 | F              |
 | final   | 92    | sam     | 2020-06-01 | A              |
 *---------+-------+---------+------------+----------------*/

Example 3: Partitioned analytic methods

We can use an analytic method to compute a value for each row (rather than the entire table). By using a partition we can define groups of rows to calculate across.

For example, in our grades data we can calculate the average score of each test.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Method: We want to calculate the average so select this analytic method.

  • Partition: This is where we define the groups in which the average will be calculated.

    • If one or more variables are selected here, the average for our new score_average variable will be computed across all rows that are the same in the selected variable(s).

    • If no variables are entered here, then the average will be computed across the entire table.

    • We want to average scores from the same test so we select test here.

  • Variable to aggregate: This variable contains the data we want to average.

  • Window: We don't want to define a window in this average so we leave it on All rows in the partition.

Output data:

/*---------+-------+---------+------------+---------------*
 | test    | score | student | date       | score_average |
 +---------+-------+---------+------------+---------------+
 | quiz    | 83    | jane    | 2020-04-01 | 69            |
 | quiz    | 35    | pat     | 2020-04-01 | 69            |
 | quiz    | 89    | sam     | 2020-04-01 | 69            |
 | midterm | 74    | jane    | 2020-05-01 | 76.3333       |
 | midterm | 62    | pat     | 2020-05-01 | 76.3333       |
 | midterm | 93    | sam     | 2020-05-01 | 76.3333       |
 | final   | 77    | jane    | 2020-06-01 | 76            |
 | final   | 59    | pat     | 2020-06-01 | 76            |
 | final   | 92    | sam     | 2020-06-01 | 76            |
 *---------+-------+---------+------------+---------------*/

To create the new variable, for each row with the same value in test, the values in score will be averaged together. So there are 3 rows with the value quiz in test which average to 69. So all rows with a quiz will show the same average of 69 in our new variable.

Example 4: Windowed analytic methods

We can use a window to calculate a moving average. A window will define how many rows before or after the current row to use when calculating the average.

Example data:

/*-------+----------+------------*
 | score | student  | date       |
 +-------+----------+------------+
 | 10    | quiz1    | 2020-01-01 |
 | 10    | quiz1    | 2020-02-01 |
 | 40    | quiz1    | 2020-03-01 |
 | 30    | quiz2    | 2020-04-01 |
 *-------+---------+------------*/

Input fields:

  • Partition: We choose not to use a partition in this example since our data does not need to be segmented.

  • Variable to aggregate: The numbers we want to average are in the score variable so we choose that here.

  • Window: We want to create a moving average based on one entry before and after the current row, so we select Rows here.

  • Order by: Our data is already ordered by date in this table, but if it wasn't we would definitely need to order on the relevant variable here.

  • Rows preceding / Rows following: This is where we define how many rows to include in the average.

Output data:

/*-------+---------+------------*---------------+
 | score | test    | date       | score_average |
 +-------+---------+------------+---------------+
 | 10    | quiz1   | 2020-01-01 | 10            |
 | 10    | quiz1   | 2020-02-01 | 20            |
 | 40    | quiz1   | 2020-03-01 | 26.6667       |
 | 30    | quiz2   | 2020-04-01 | 35            |
 *-------+---------+------------+---------------*/

For each row, one row preceding and following is used to compute the score_average. So for the first row we average 10 and 10 (since no preceding rows exist, it is excluded). For the second row 10, 10, and 40 are averaged. This process repeats until the end of the table is reached.

Note that we could also use Range instead of Rows for our window if our question was time based (e.g. average score over 1 month preceding and 1 month following).

Example 5: Windowed & partitioned analytic methods

Continuing the previous example, if we had included a partition then this same process would be completed separately for the values in each partition. So if we had partitioned on test in this example, our outcome would look different

Output data:

/*-------+---------+------------*---------------+
 | score | test    | date       | score_average |
 +-------+---------+------------+---------------+
 | 10    | quiz1   | 2020-01-01 | 10            |
 | 10    | quiz1   | 2020-02-01 | 20            |
 | 40    | quiz1   | 2020-03-01 | 25            |
 | 30    | quiz2   | 2020-04-01 | 30            |
 *-------+---------+------------+---------------*/

Since quiz2 is in a separate partition, those rows are averaged separately.

Last updated 7 months ago

Was this helpful?

The way that the new variable will be created. Choosing this will bring up additional fields to complete specific to the chosen method. .

Some methods are only available for certain variable types, so you might need to variables before you can use them in the method you've chosen.

More methods to choose from can be found in the .

Comparison statements used here will operate the same as they do in the step and can be nested in the same way.

retype
Variable creation methods
Filter
See all methods here