Reshape tables in transforms
Transforming tables in a project is a crucial step when working with data on Redivis. Conceptually, transforms execute a query on source table(s), whose results are materialized in a new output table. They are optimized to run on billions of records in seconds, and create a transparent, reproducible record of data transformation steps you've taken in your project.
In most cases you'll want to use transforms to create an output table containing all information you're interested in before analyzing that table in a notebook or exporting it for further use.
Once you've created a project and added a dataset to it that you'd like to work with, get started by creating a Transform. You can do this by clicking on any dataset or table.
All transforms have one source table, and one output table. Transforms can also join in additional tables to merge and combine with your source. You can see these links on the map as you create transforms.
When building this transform, you can choose whether you want to use the point and click interface or write SQL code. If you choose to use the interface, you can click the Code button at any time to see the code that the interface is generating. You can edit this code directly at any point, but once you do it can't be converted back to the interface. You can choose whether to use the interface or SQL for each transform.
While SQL might not be a familiar language, it is optimized for data cleaning procedures and allows these transforms to execute extremely quickly. It also allows for you to write your data transformations in a declarative, reproducible manner. The rest of this guide will be for the interface (which we recommend for most users) but if you would like to use SQL code you can learn more about Redivis-specific syntax in our reference section.
The first thing you'll want to do is decide if you have all the information you need in this table or if you'd like to join in an additional table or tables. You can reference any other table from either 1) this dataset, 2) another dataset, or 3) an output table from a different transform in this project. Any table you want to reference needs to be in this project, so you can click the Add dataset button to add a new dataset to the project if it's not here already.
To join a table, click + Join table, and select the table you'd like to join. You'll then need to select what type of join it will be. A union join will append tables together and should be used when the schema of the tables are the same, while any of the other joins will link additional variables to records that already exist in your table.
For most join types you'll need to select the variable(s) whose content will be matched on to execute the join. Generally this might be an identifier or common code that will link information in this table to the other table you've selected.
You might want to generate new columns in this table by creating a new variable. You can do so by clicking + Create new variable. Start by giving this variable a name and then choosing what method you want to use to create it.
Some methods are restricted to the type of variable you are working with. For example there are options to add or subtract years from a date variable, or concatenate string variables.
One of the most common new variable methods is Case (if/else). This allows you to set up a statement that looks at the content of each record and evaluates it based on conditions you've set up to generate the value. For example you can say that in the new variable you're creating, if the amount in column A is greater than 1000, the value of this new variable will be set to "high" and if not, then it will be set to "low"
You can create any number of new variables, and they will execute sequentially, allowing you to reference variable's you've created in other variables and subsequent sections.
You'll probably want to reduce the number of records in this table to exclude any that aren't relevant to the final output table you're creating. This will allow you to execute transforms quickly and get a better understanding of the number of relevant records you have.
To filter records, click + Filter rows and start building the conditions that records will need to meet in order to stay in your output table. These statements can be nested and allow you to reference any variables or record values.
If you find yourself working with a standard list of values you're using in multiple places, this might also be a place to save time and enhance reproducibility by creating and referencing a value list.
Depending on the structure of your data you might want to aggregate the data to collapse multiple records down to one, while preserving some information about what was dropped. Conceptually this might look like aggregating a table of charges from one record per charge into one record per person including a variable for the total charge per person.
To get started, click + Create new partitioned variable and name the new variable you want to create. This will be the variable with aggregated information, such as the sum or count of records.
Next, select the method you'll use to aggregate the information (such as SUM, or COUNT). Then you'll need to define the variable (or multiple variables) you want to partition on. This should be a variable (or variable combination) with duplication in your table. Perhaps it is an ID field where there are multiple records per ID. This new variable you're creating will find all records with matching values in the partition variable (or variables) you've specified and then perform the aggregation method you've defined on that set of records. For example, it will find all records with the same value in the ID field and then SUM all values in the CHARGE field to create a new variable.
Once you've created this variable, you can filter on it in this section as well.
By default, creating a partitioned variable does not drop duplicate records, so if you want to discard redundant records you'll need to toggle on the option at the bottom to Drop rows which are exact duplicates across all kept variables. This might mean carefully choosing what variables you keep in your output table (more information on that below).
Finally, before running your transform you'll always need to select which variables you want to keep in your output table. Perhaps you referenced a variable in this transform to create a new one, and now you don't need it anymore. Cutting variables means faster execution, and it is easy to add any variables back later and re-run the transform if you realize you need it, so in general try to keep this list as short as possible.
To propagate variables forward into your output table, they need to be in the right-hand box at the top of the transform labeled Keep. You can select any variable, or set of variables and click the > arrow button to move them over.
This section is also where you can quickly rename or retype a variable by right clicking on it and making a selection. Any changes you make here will be available in the rest of this transform, so you might need to retype a string variable to an integer in order to use math methods on it, or join on it in a table join.
When you're ready to execute the transform click the Run button in the top right of the toolbar.
If this button is disabled, it might be because the transform is invalid for some reason. Hovering on this button will give you more information, or you can look for the alert symbol (!) to see where you'll need to fix the transform in order to make it valid to run.
Successfully running a transform will create a new output table. Click on this table to see your output and verify that the transform executed as you would expect. Does it have the approximate number of records you expect to see? If you click on a variable you can see summary statistics which can help you verify key variable concepts, such as the distinct number of values.
Every time you run a transform you will want to sanity check your output.
If there are any issues with your output table, or if you decide to go in a different direction, or add another step, it is easy to go back to your transform and start making changes. You'll notice that this transform is now yellow and so is its output table. These edited and stale state indications help you keep track of work you're doing in your project, and you can easily revert to the previous successfully executed transform state at any point.
Run this transform again to see changes in the output table. This guide describes all the steps you can take in a single transform but perhaps you want to do one step at a time and run it in between each step to sanity check your output. This system is designed for iteration so make as many changes as you want to experiment and build the output you want.
From here you can continue to create transforms on your output table, tables from the dataset you added, or add any other dataset on Redivis. You might want to fit as many steps as you can into one transform, or make a long chain to more easily track your work and communicate it to others.
As you build more transforms you'll see that sometimes actions you take create stale chains of transforms. You can easily make upstream changes in your project (such as upgrading a dataset from the sample to the full dataset, or updating to a new version) and then run all transforms in your project with one click (a Run all option is available in the Map button menu).
You can use notebooks in a project to analyze data using Python, R, Stata, or SAS. These notebooks run in the browser with no additional configuration and seamless sharing with collaborators.
If you'd like to export data to a different system, you can download it in various file formats, reference it programmatically in Python / R, or visualize in tools such as Google Data Studio.
Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).
Redivis projects are built for collaboration and include real-time visuals to see where collaborators with edit access are working in the project, and a comments interface to discuss changes asynchronously.
Share your project to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.