Reshape data in transforms
Last updated
Last updated
Transforming tables in a workflow is a crucial step when working with data on Redivis. Conceptually, transforms execute a query on a source table and results are materialized in a new output table. They are optimized to run on billions of records in seconds, and create a transparent, reproducible record of data transformation steps you've taken in your workflow.
In most cases you'll want to use transforms to create an output table containing all the information you're interested in before analyzing that table in a notebook or exporting it for further use.
Once you've created a workflow and added a dataset to it that you'd like to work with, get started by creating a Transform. You can do this by clicking on any dataset or table.
To build this transform, you will add steps which each take an action to shape the data in the output table. You can choose from many named steps which include a point and click interface that compile to SQL code, or you can add a SQL query step to write code directly. You can always view the code that your step is generating and switch to code if you'd like to edit it.
While SQL might not be a familiar language, it is optimized for data cleaning procedures and allows transforms to execute extremely quickly. It also allows you to write your data transformations in a declarative, reproducible manner.
The first thing you'll want to do is decide if you have all the information you need in this table or if you'd like to join in an additional table or tables. You can reference any other table from either 1) this dataset, 2) another dataset, or 3) an output table from a different transform in this workflow. Any table you want to reference needs to be in this workflow, so you can click the Add dataset button to add a new dataset to the workflow if it's not here already.
To join a table, add a Join step and select the table you'd like to join. You'll then need to select what type of join it will be, and build your join condition. In most cases your join condition will be linking two variables of the same name and type together (e.g. join all records where id
= id
).
Learn more in the Joins reference section.
You might want to generate new columns in this table by creating a new variable. You can do so by adding a Create variables step. Start by giving this variable a name and then choosing what method you want to use to create it.
Some methods are restricted to the type of variable you are working with. For example there are options to add or subtract years from a date variable, or concatenate string variables.
One of the most common new variable methods is Case (if/else). This allows you to set up a statement that looks at the content of each record and evaluates it based on conditions you've set up to generate the value. For example you can say that in the new variable you're creating, if the amount in column A is greater than 1000, the value of this new variable will be set to "high" and if not, then it will be set to "low."
You can create any number of new variables, and they will execute sequentially, allowing you to reference variable's you've created in other variables and subsequent sections.
Learn more in the Create variables reference section.
You'll probably want to reduce the number of records in this table to exclude any that aren't relevant to the final output table you're creating. This will allow you to execute transforms quickly and get a better understanding of the number of relevant records you have.
To filter records, add a Filter step to your transform and start building the conditions that records will need to meet in order to stay in your output table. These statements can be nested and allow you to reference any variables or record values.
If you find yourself working with a standard list of values you're using in multiple places, this might also be a place to save time and enhance reproducibility by creating and referencing a value list.
Learn more in the Filters reference section.
Depending on the structure of your data you might want to aggregate the data to collapse multiple records down to one, while preserving some information about what was dropped. Conceptually this might look like aggregating a table of charges from one record per charge into one record per person including a variable for the total charge per person.
To get started, add an Aggregate step and select the variables that you will to aggregate on. These will be the variables that exist in your output data after the aggregation is finished. All records that are an exact match in these selected variables will be dropped.
You can also capture information about records being dropped by creating a new aggregate variable. For example, maybe you are aggregating a table with multiple test scores per person down to a table with just one record per person. You can create an aggregate variable with the average test score, or the count of the number of tests each person took.
Learn more in the Aggegate reference section.
Finally, before running your transform you'll always need to select which variables you want to keep in your output table. Perhaps you referenced a variable in this transform to create a new one, and now you don't need it anymore. Cutting variables means faster execution, and it is easy to add any variables back later and re-run the transform if you realize you need it, so in general try to keep this list as short as possible.
To propagate variables forward into your output table, they need to be in the right-hand box at the top of the transform labeled Keep. You can select any variable, or set of variables and click the > arrow button to move them over.
Learn more in the Variable selection reference section.
When you're ready to execute the transform click the Run button in the top right of the toolbar.
If this button is disabled, it might be because the transform is invalid for some reason. Hovering on this button will give you more information, or you can look for the alert symbol (!) to see where you'll need to fix the transform in order to make it valid to run.
Successfully running a transform will create a new output table. Click on this table to see your output and verify that the transform executed as you would expect. Does it have the approximate number of records you expect to see? If you click on a variable you can see summary statistics which can help you verify key variable concepts, such as the distinct number of values.
Every time you run a transform you will want to sanity check your output.
If there are any issues with your output table, or if you decide to go in a different direction, or add another step, it is easy to go back to your transform and start making changes.
You'll notice that this transform is now yellow and so is its output table. These edited and stale state indications help you keep track of work you're doing in your workflow, and you can easily revert to the previous successfully executed transform state at any point.
Run this transform again to see changes in the output table. This guide describes all the steps you can take in a single transform but perhaps you want to do one step at a time and run it in between each step to sanity check your output. This system is designed for iteration so make as many changes as you want to experiment and build the output you want.
From here you can continue to create transforms on your output table, tables from the dataset you added, or add any other dataset on Redivis. You might want to fit as many steps as you can into one transform, or make a long chain to more easily track your work and communicate it to others.
As you build more transforms you'll see that sometimes actions you take create stale chains of transforms. You can easily make upstream changes in your workflow (such as upgrading a dataset from the sample to the full dataset, or updating to a new version) and then run all transforms in your workflow with one click (a Run all option is available in the Map button menu).
You can use notebooks in a workflow to analyze data using Python, R, Stata, or SAS. These notebooks run in the browser with no additional configuration and seamless sharing with collaborators.
Learn more in the Work with data in notebooks guide.
If you'd like to export data to a different system, you can download it in various file formats, reference it programmatically in Python / R, or visualize in tools such as Google Data Studio.
Learn more in the Export to other environments guide.
Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).
Learn more in the Create and populate a dataset guide.
Redivis workflows are built for collaboration and include real-time visuals to see where collaborators with edit access are in the workflow, and a comments interface to discuss changes asynchronously.
Share your workflow to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.