Transform concepts
Overview
A transform is a series of steps detailing a data transformation that are executed in sequential order when the transform is run. These steps and a final variable selection create an output table that can be inspected and further transformed.
Transforms vs. notebooks?
There are two mechanisms for working with data in workflows: transforms and notebooks. Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.
Transforms are better for:
Reshaping + combining tabular and geospatial data
Working with large tables, especially at the many GB to TB scale
Preference for a no-code interface, or preference for programming in SQL
Declarative, easily documented data operations
Notebooks are better for:
Interactive exploration of any data type, including unstructured data files
Working with smaller tables (though working with bigger data is possible)
Preference for Python, R, Stata, or SAS
Interactive visualizations and figure generation
Steps
The majority of building a transform is choosing and completing your data transformation steps. Steps represent one specific action (e.g. filter rows, join tables) and are completed in the Redivis transform interface. Every step is a fundamentally a piece of SQL code that can be examined and could be reproduced by running in any environment with the same data.
When running the transform, steps will be executed in order, building off of what directly came before them.
Steps have additional options in the ⋮
menu to help in the building process:
Collapse / expand
Annotate
Disable / enable
Reorder (this might change the outcome)
Output variables
The final task in constructing a transform is to select which variables you want to have in your output table. You will use the variable selector in the bottom pane of the transform to choose which to keep.
Running transforms
Transforms are iterative and are designed to easily check your progress as you work. As long as a transform is valid, you can press the Run
button to generate your output table. Click on this output table node in the workflow tree to see your results once the run is complete.
Invalid state
The transform interface will prevent you from running a transform that is invalid. This will be displayed with an invalid icon on the step or the part of a step where the issue is. Usually when you have an invalid alert it means some crucial information is missing and you can follow the invalid icons to find the locations you'll need to update.
History and revert
Every time you run the transform, a snapshot of the transform at that point in time is saved. You can click on History to view the log of these runs, and click on any snapshot to revert the transform to that point in time.
Your transform will then be in the Edited state and no changes will be made to the output table. Once you run it, this configuration will become a new snapshot at the top of the transform history list.
Checking the output
Output tables don't just exist to store data, you can use the variable statistics generated in the table node to make sure your transform did what you expected it to do.
Common things to check might be total number of rows and variables in your output table, distinct and null counts of a key variable, or even running a query on the Query tab of the table to filter by a particular value.
If you find the table isn't what you want, you can go back to the transform to make changes and rerun it. Transforms were created to be iterative!
Additional tools for working with transforms
Name and rename
By default transforms are named with a number based on the order they were created. We suggest renaming your transforms with descriptive names so you and your collaborators can quickly find them later on. You can rename a transform by clicking on the title of the node, or on the Rename option in the transform ⋮ More
menu.
If you have not specifically renamed the connected output table, renaming a transform will also rename this output table.
Change the source table
All transform nodes have one source table. You can join in multiple tables, but the source table designation will affect how the joins execute. You can change the source table from the ⋮ More
menu in the transform. Note that if the new source table has different variables your transform might become invalid until you update it.
Split and combine transforms
All transforms can be split at the step level into two different transforms by clicking Split in any step's menu. Additionally, two transforms can be combined into one by right clicking on a table to Remove it.
You might want to split a transform above a tricky step to see what the output table would look like at that point in the process. This can be a key tool in troubleshooting any issues and understanding what might be going wrong.
After splitting a transform to check an output table, the next logical step might be to combine these two transforms back into one again. Or perhaps you have a string of transforms which you no longer need the output tables for and want to reduce the size of your workflow.
Copy, paste and insert
As you go, you might find yourself wanting to take certain data cleaning actions multiple times or move them around in your process. You can right click on any transform in the workflow tree to see options for copying it or deleting it. Right click on any table node to paste the copied transform. You can also insert transforms above other transforms.
Steps and parts of steps can also be copied and pasted within the same transform or across multiple transforms.
View SQL code
All transforms generate SQL, which will executed the transform is run. To view the code for a particular step, click on View SQL in the step menu. You can also see the SQL for the entire transform from within the ⋮ More
menu of the transform.
You can convert this code into a SQL query step and run it to achieve the same outcome, or edit the code directly.
SQL Concepts
Redivis uses SQL as the basis for data transformation for many reasons. It is designed for working quickly and efficiently with large tables, and is easy to learn concepts and basic executions, and it is highly reproducible.
While Python, R, Stata, and SAS are commonly used in data analysis, they are not suited to working with large tables. Executing queries on millions of records can take hours and in many cases fail. We have provided a point and click interface for using transforms to make using a perhaps unfamiliar coding language easier.
Fundamentals
Tables are data structures containing rows and variables
Variables represent fields in the table that all rows contain values for
Rows represent an individual observation in table with values in across all variables
A SQL query will always operate on data from one or more tables and output a single data table.
Building a query
When you are using the interface or SQL code, it is best to start the process by figuring out what you would like your output table to look like. What are the variables? What do you want each row of the table to represent?
Once you figure out what you want your output to look like, you can work backwards and formulate the steps of your query to build the shape you'd like.
Last updated