Core concepts

Overview

It might be useful to think of a project as a large folder. It contains datasets, alongside transforms – which query those dataset tables – and their output tables, and any notebooks used to analyze your outputs. In a Redivis project, these entities are visually arranged to make it easy to see how your transforms, tables, and notebooks are related to each other, and to make it easier to make changes that affect your whole project.

The left half of the screen is where you'll see all entities that currently exist in your project. By default this is laid out as a tree of connected nodes to better understand connections between tables. You can also switch this view to a list. If you created this project from a dataset you'll see a rectangle with the dataset name next to it to start.

Node types

Each shape, or node, on the project tree represents a different entity in your project.

Dataset nodes are datasets across Redivis.

Table nodes are either dataset tables, or the resulting output table of the upstream transform.

Transform nodes are queries that don't contain any data themselves which are used to shape data by creating new tables.

Notebook nodes are code blocks (and their outputs) which are used to analyze data.

If you ever get lost, you can use the Search button in the left of the black menu bar and input the name of a node to jump to it.

Datasets

A dataset node is a copy of a dataset in a project.

Dataset nodes display a list of the tables they contain. You can click on any table to view its contents, or click "Query" to build a transform on it.

Samples

Some large datasets have 1% samples which are useful for quickly testing querying strategies before running transforms against the full dataset.

If a 1% sample is available for a dataset, it will automatically be added to your project by default instead of the full sample. Samples are indicated by the dark circle icon to the top left of a dataset node in the left panel and in the list of the dataset's tables.

All sampled tables in the same dataset will be sampled on the same variable with the same group of values (so joining two tables in the same dataset with 1% samples will still result in a 1% sample).

​To switch to the full sample, click "Sample" button in the top right of the menu bar when you have a dataset selected.

Your downstream transforms and tables will become stale, since an upstream change has been made. You can run these nodes individually to update their contents, or use the run all functionality by clicking on the project's name in the top menu bar.

Versions

When a new version of a dataset is released by an administrator, the corresponding dataset node on your project minimap will become purple. To upgrade the dataset's version, click the "Version" button in the top right of the menu bar when you have a dataset selected.

You can select whichever version you want to use here, or view the full version history on the dataset page.

After updating, your downstream transforms and tables will become stale. You can run these nodes individually to update their contents, or use the run all functionality by clicking on the project's name in the top menu bar.

Tables

A dataset table refers to a single table (a unique set of rows and colums) that was uploaded to the dataset by the owner. These tables are shown directly underneath the dataset when you create a transform or notebook from that table.

An output table is automatically created when you create a transform node. Running the transform generates the data in this output table.

All table nodes have one upstream parent. You can view the table's data and metadata similarly to other tables on Redivis. You cannot edit or update the metadata here.

You can create multiple transforms to operate on a single table, which allows you to create various branches within your project. To create a new transform, select the table or dataset and click the small + icon that appears under the bottom right corner of the node.

Sanity check output

After you run a transform, you can investigate the downstream output table to get feedback on the success and validity of your querying operation – both the filtering criteria you've applied and the new features you've created.

Understanding at the content of an output table allows you perform important sanity checks at each step of your research process, answering questions like:

  • Did my filtering criteria remove the rows I expected?

  • Do my new variables contain the information I expect?

  • Does the distribution of values in a given variable make sense?

  • Have I dropped unnecessary variables?

To sanity check the contents of a table node, you can inspecting the general table characteristics, checking the summary statistics of different variables, looking at the table's cells, or create a notebook for more in-depth analysis.

Transforms

Transforms are at the core of every project, allowing for comprehensive data merges and transformations. Learn more about building transforms in the Transform documentation.

Create a new transform by clicking the + button beneath any table. Transforms can only reference tables that are present in this project.

To copy a transform, right click the transform and select Copy transform. This will copy the transform, including all parameters specified in the detail view, and allows you to insert it somewhere else in your project tree, to re-use querying logic. Note that tables cannot be copied alone; copying a transform node will copy the transform and its downstream table.

You can also insert a transform between two tables by right-clicking on another transform.

Notebooks

Notebook nodes allow you to work with data in a Jupyter notebook interface, taking advantage of the open-source community and scientific computing toolkit available in Python, R, Stata, or SAS. Learn more about using notebooks in the Notebooks documentation.

Create a notebook by clicking the + button beneath any table. Notebooks can only reference tables that are present in this project.

To copy a notebook, right click and select Copy notebook. You can paste the copied notebook by right-clicking on the background of the project's tree view to the left, and selecting Paste copied notebook.

Node layout

The project tree automatically creates a grid layout of all the nodes in your project, helping to keep it organized as your project grows.

Sometimes, you may wish to reorganize certain nodes in your project. To shift a dataset or notebook node, hover and click the arrow to the side of the node.

This will move the node to the right or left, and reorganize your tree according to the new horizontal order of datasets at the top of your project tree (or notebooks at the bottom of your project tree). Note that shifting nodes is purely an organizational tool; it has no effect on the data produced in the project.

Node states

Empty

Display: White background

A transform node will be white if it has never been run.

A notebook or table node will be white if it contains no data.

Executed

Display: Grey background

A transform will be grey when it has previously been run and has not since been edited or had anything change upstream.

A notebook or table node will be grey if it contains data, and no upstream transforms have been edited (if there was an upstream change, everything downstream would be stale)

Invalid

Display: Black exclamation icon

A transform will be invalid when it is unable to be run. This might be because you haven't finished building the steps, or because something changed upstream which made its current configuration impossible to execute again.

Errored

Display: Red exclamation icon

A transform will be errored when you run them and the run can't be completed. This might be due to an incorrect input you've set that our validator can't catch. Or something might have gone wrong while executing and you'll just need to rerun it.

Edited

Display: Yellow background with diagonal hash lines

A transform will be edited when you revisit a successfully run transform and change a parameter. You can either Run this transform or Revert to its previously run state to resolve it. Editing a transform makes any downstream nodes stale.

Stale

Display: Yellow background

A transform, table, or notebook will be stale when an upstream change has been made. For tables and notebooks immediately downstream from an edited node, means that the data contents might no longer be the results of the previous transform.

You'll need to re-run any edited upstream transforms to propagate new data into downstream tables and nodes, or revert an upstream edited node to return to the previously executed state.

Running and queued

Display: Double arrows rotating

Transforms have this icon when the node is currently being run (if the icon is spinning) or it is queued to run after upstream nodes have finished running (icon isn't moving).

You can cancel queued and running on each individual transform or by clicking the Run menu in the top bar and selecting Cancel all. If a node is currently running it might not be able to cancel, depending on what point in the process it's at.

Incomplete access

Display: All black background, or dashed borders

For all nodes, this means that you don't have full access the node. Click on these nodes and then the Incomplete access button to begin applying for access to the relevant datasets.

Sampled

Display: Black circle with 1% icon

For datasets this means that you are using a 1% sample of the data. When a dataset has a sample, it will automatically default to it when added to a project. You can change this to the full sample and back at any time in the dataset node.

Outdated version

Display: Purple background

For datasets this means that you are not using the latest version. This means that you have either intentionally switched to using an older version, or that this dataset's administrator has released a new version that you can upgrade to.

Working in bulk

At any point you might realize that you need to change a parameter of a query that will affect man downstream tables. This will make these tables stale and you'll see their color turn to yellow on the map.

After finishing your updates you can run each transform individually to propagate changes or you can use the Run button in the top menu to run many nodes in sequence. This menu gives you the option to run all stale nodes, or all downstream or upstream nodes (from the node you have selected).

Deleting nodes

To delete a node, right click on a dataset or transform node and select Delete.

When deleting a transform, the transform and output table will be deleted; every transform must have an output table to record results of that transform . If the project tree has additional nodes downstream, the transform and output table will be 'spliced' out, i.e. the upstream node nearest the deleted transform will be connected to the downstream node nearest to the deleted output table. Note that this deletion will cause the next downstream transform to receive new input variables from the node that's directly upstream. (In the above example, deleting the selected transform will result in the 'Optum SES Inpatient Confinement' dataset being connected directly to the remaining transform, which will change the variables available to work with in that transform.)

When deleting a dataset or dataset table, the dataset and all downstream nodes will be deleted. If additional branches are joined into the branch downstream of the deleted dataset, those branches will be retained up to but not including the transform located in the deleted branch.

Since you can't undo a deletion, you'll receive a warning message before proceeding.

As you make changes in a project you will change the status of different nodes connected to it. These changes in status are shown in the left panel of the project to help you keep track of any changes.