Workflow concepts

Overview

Workflows are used to analyze any type of data on Redivis, at any scale. They allow you to organize your analysis into discrete steps, where you can easily validate your results and develop well-documented, reproducible analyses.

Workflows are owned by either a user or an organization, and can be shared with other users and organizations.

Creating a workflow

To create a workflow, navigate to a dataset that you're interested in working with and press the Analyze data in a workflow button. If you do not have data access to this dataset you may need to apply for access first.

You can also create a workflow from the "Workflows" tab of your workspace, or from the administrator panel of an organization (in this latter case, the workflow will be "owned" by the organization and its administrators).

Once you've created your workflow, you'll be able to add any dataset or workflow that you have access to as a data source.

The workflow page

The workflow page consists of a top title bar, a left panel, and a right panel.

The left panel displays the workflow tree, allowing you to visualize how data is moving through the workflow and its nodes.

The right panel shows the contents of the currently selected node in the tree. If no node is selected, this panel will display the workflow's documentation. You can click on the workflow title, or empty space in the workflow tree, to return to the workflow documentation at any time.

The title bar provides an entry point to common actions, broken into two sections: the left section contains actions that are global to the workflow, while the right section contains actions relevant to the currently selected node (e.g,. running a transform).

If no node is selected, information about the workflow will be populated in the right panel:

Overview

The workflow overview contains various metadata, provenance information, and narrative about the workflow.

Abstract

The abstract is limited to 256 characters and will show up in previews and search results for the dataset. This should be a concise, high-level summary of this dataset.

Provenance

View and update information about this workflow's creators and contributors, citation, and any related identifiers detected by Redivis automatically or added by a workflow editor. This is also where you can issue a DOI for this workflow.

Provenance

→ Creators

This section automatically defaults to displaying the owner of the project. Workflow editors can add or remove anyone from this list. Anyone included here will be added to the citation generated for this workflow.

Provenance

→ Contributor

This section automatically includes anyone who edited this workflow. Workflow editors can add or remove anyone from this list.

Provenance

→ Citation

This section shows the automatically generated citation for this workflow in your chosen format. This can be copied or downloaded for use elsewhere.

Changes made to the "Creators" field will be reflected in this citation. Any DOI issued for this workflow will automatically be included in this citation.

Provenance

This section automatically includes any datasets or workflows referenced by this project, including data sources, study collaboration, or what this workflow was forked from. Workflow editors can add any related identifiers from outside of Redivis through links or DOIs, including DMPs, papers referenced, and more.

Provenance

→ Bibliography

You can launch a bibliography which displays the citation of this workflow and all of its related identifiers.

Methodology

Document the details of your research aim or data analysis strategy. You can also embed links or images.

Sharing / Sharing

You can give other users access to view or edit your workflow, or transfer ownership to another user in the Sharing section. You can also set this workflow's visibility and discoverability. Anyone viewing your workflow will need to have gained data access to any restricted datasets to view the relevant node contents.

Study

You can add your workflow to a study in order to facilitate collaboration with others. For certain restricted datasets, your workflow will need to be part of an approved study in order to run queries.

Tags

You can add up to 25 tags to your workflow, which will help researchers discover and understand it.

Usage

You can see the data of last workflow activity, how often the workflow was forked or viewed (if it is a public workflow).

Tags

In addition to documentation, you may add up to 25 tags to your dataset, which will help researchers discover and understand the dataset.

Other metadata

Additionally, information about the dataset's size and temporal range will be automatically computed from the metadata on its tables. Additional table documentation, as well as the variable metadata, will be indexed and surfaced as part of the dataset discovery process.

Data sources

A filterable list of all Data sources within the workflow. Clicking on an item will navigate to the corresponding data source in the workflow tree.

Tables

A filterable list of all Tables within the workflow. Clicking on an item will navigate to the corresponding table in the workflow tree.

Transforms

A filterable list of all Transforms within the workflow. Clicking on an item will navigate to the corresponding transform in the workflow tree.

Notebooks

A filterable list of all Notebooks within the workflow. Clicking on an item will navigate to the corresponding notebook in the workflow tree.

The workflow tree

The workflow "tree" is represented visually in the left pain of the workflow. This tree is made up of a collection of nodes, with each node having various inputs and outputs, such that the output (result) of one node can serve as the input of another.

Data in the tree flows from the top to bottom, and circular relationships are not allowed. Formally, this is known as a "Directed Acyclic Graph" (DAG).

Clicking on a node within the tree will display that node's contents within the right pane of the workflow, while highlighting the ancestors and descendants of that node on the tree.

You can right-click on any node for a list of other options, or if preferred, click on the node and then click the three-dot "More" menu at the top-right.

Workflow nodes

The workflow tree is made up of the following node types:

Data sources represent datasets or workflows that have been added to your workflow, and are the mechanism for bringing data into your workflow.

Tables are either tables associated with a data source, or the resulting output table of a transform or notebook.

Transforms are queries that are used to reshape and combine data, always creating a single table as an output.

Notebooks are flexible, interactive programming environments, which can optionally produce a table as an output.

Building a workflow

The main way to build your workflow is to add and edit nodes. You will start by adding data to your workflow, and then create a series of additional nodes that reshape and analyze the data.

Add data to a workflow

You can click the Add data button in the top left corner of the workflow toolbar to select a dataset or another workflow you want to add to this workflow. This will add a copy of the selected data source to the workflow and allow you to reference its tables.

Each data source can only be added to a workflow one time. By default, all datasets are added at their current version but you can right click on the dataset in this modal to select a different version to add.

Reshape and analyze data

All data cutting, reshaping, and analysis on Redivis happens in either a transform or a notebook. These nodes must be attached to a source table, so can only be created after you've added a data source.

To create a transform or notebook, click on a table and select either the transform or notebook button that appears beneath it. If the table already has a downstream node you can press the plus icon beneath it instead.

Transforms vs. notebooks?

There are two mechanisms for working with data in workflows: transforms and notebooks. Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.

Transforms are better for:

Reshaping + combining tabular and geospatial data
Working with large tables, especially at the many GB to TB scale
Preference for a no-code interface, or preference for programming in SQL
Declarative, easily documented data operations

Notebooks are better for:

Interactive exploration of any data type, including unstructured data files
Working with smaller tables (though working with bigger data is possible)
Preference for Python, R, Stata, or SAS
Interactive visualizations and figure generation

Copy and paste nodes

You can right click on any transform or notebook in the workflow tree to copy it. Once you've copied a node, you can right click on any table to paste the copied transform or notebook.

Insert nodes

If you would like to place a copied transform or notebook between other nodes, you can click on a transform or notebook and select Insert transform.

If you have a transform copied to the clipboard you can insert it between other nodes by right clicking on a transform or notebook and selecting Paste copied transform above. This will insert both the transform and its output table into the branch of the workflow you've selected.

Split and combine transforms

All transforms can be split at the step level into two different transforms by clicking Split in any step's menu. Additionally, two transforms can be combined into one by right clicking on a table to Remove it.

You might want to split a transform above a tricky step to see what the output table would look like at that point in the process. This can be a key tool in troubleshooting any issues and understanding what might be going wrong.

After splitting a transform to check an output table, the next logical step might be to combine these two transforms back into one again. Or perhaps you have a string of transforms which you no longer need the output tables for and want to reduce the size of your workflow.

Delete nodes

To delete a node, right click on the node and select Delete. Tables cannot be deleted directly, but are rather deleted when their parent node is deleted.

When deleting a transform or notebook:

The transform or notebook and its output table will be deleted.
If the workflow tree has additional nodes downstream, the transform or notebook and its output table will be 'spliced' out, i.e. the upstream node nearest the deleted transform will be connected to the downstream node nearest to the deleted output table.

When deleting a data source:

The data source and all directly downstream nodes will be deleted. If additional branches are joined into the branch downstream of the deleted dataset, those branches will be retained up to but not including the transform located in the deleted branch.

Since you can't undo a deletion, you'll receive a warning message before proceeding.

Node states

As you build out a workflow, node colors and symbols will change on the tree to help you keep track of your work progress.

Detailed information about each of these states can be found in the documentation for each node, though some common states are outlined here.

Stale nodes

Stale nodes are indicated with a yellow background. If a node is stale, it means that its upstream content has changed since the node was last run, and likely that the node should be re-run to reflect these upstream changes.

Edited nodes

If node has been edited since when it was last run, it will be indicated with hashed vertical lines.

Tree-level actions

Run all

You can select the Map button on the left side of the workflow toolbar to begin a run of all stale nodes in the workflow. This will execute all transform and notebook nodes in a logical sequence to update the workflow completely.

Shift nodes

To shift a node, select the node and click the arrow that appears next to most nodes when selected. Shifting nodes is purely an organizational tool and it has no effect on the data produced in the workflow.

Navigate nodes

Along with clicking a node to select it, all nodes have a Source and Output navigation button on the right side of the workflow toolbar. You can click this button to jump directly to the immediate upstream or downstream node

Saving

Workflows are continuously saved as you work, and any analyses will continue to run in the background if you close your browser window. You can always navigate back to this workflow later from the "Workflows" tab of your workspace.

A complete version history is available for all transforms and notebooks in your workflow, allowing you to access historic code and revert your work back to a previous point in time.

All workflows are owned by either a user or an organization, and can then be shared with other users and organizations. When a workflow is owner or shared with an organization, all administrators of that organization will have corresponding access.

Workflows can also be associated with a study, which may be necessary if access to certain datasets in the workflow was granted to that study. In this case, you can specify a level of access to the workflow for other collaborators on the study.

Workflow collaborators will still need access to the underlying data in a workflow to view node contents. For more information, see Workflow access & sharing.

Forking workflows

You have a couple options for how this workflow can be reused in other analyses. Click the Fork button in the toolbar to get started.

Add to another workflow
- Select this option to choose a workflow you'd like to add this workflow to, as a data source. This will be a linked copy that will update as the original workflow is updated. This can be a very powerful tool in building out complex workflows that all reference the same source analysis or cohort.
Clone this workflow
- This will create a duplicate copy of the workflow, with a link back to the original workflow encoded in its provenance information.

Collaboration

Workflows are made to be a space for collaborative analysis. You can easily share access to your workflow with colleagues.

Comments

Any node in a workflow can be annotated with a comment by workflow collaborators. Comments are intended to be a space for conversation grounded in a specific area of a workflow. They can be replied to in threads by multiple collaborators and resolved when the conversation is complete.

Simultaneous editors

Multiple users with edit access can be working on a workflow at the same time. When this is the case, you will see their picture in the top menu bar alongside your own and a colored dot on the workflow tree to the right of the node they currently have selected. When a notebook is started you will see any collaborators code edits in real time.

Workflow DOIs

Any workflow editor can issue a DOI (Digital Object Identifier) for a workflow. A DOI is a persistent identifier that can be used to reference this workflow in citations. DOIs are issued through DataCite and do not require any configuration with your own or your organization's DataCite account.

Open the Provenance section and click Issue DOI. Once created, you will be able to see the DOI and view the record on DataCite.

Draft status

When DOIs are issued they enter a "Draft" status where the identifier is assigned but it has not been permanently created. All DOIs issued for workflows will remain in this draft status for seven days to allow for removal of the DOI.

You can start referencing the DOI immediately while it is still in draft status since the final DOI will not change once it becomes permanent. After the seven day draft period the DOI will automatically become permanent if your workflow is set to be publicly visible.

Since DOIs are intended for public reference, they will not be issued for workflows that remain fully private.

Note that granting public access to a workflow does not grant access to any restricted data it contains. Any individual viewing the workflow will need to also gain data access to see workflow nodes that reference restricted data.

Reproducibility and change management

Every time a transform or notebook is run in your workflow, a snapshot of the code in that node is permanently saved. On any transform or notebook, you will see a "History" button that will bring up all of the previous executions of that node, with the ability to view its historic contents and revert to a previous version of the code. This historic code will also be associated with the corresponding log entry in your workspace.

While the tables within a workflow should be considered "live" in that their data can regularly change as upstream nodes are modified, the ability to permanently persist code (alongside the built in version-control for datasets) ensures that any historic output can be reproduced by simply re-running the historic code that produced a given output.

PreviousWorkflows NextData sources

Last updated 3 months ago

Was this helpful?

Good evening

Overview

Creating a workflow

The workflow page

Overview

Provenance

→ Citation

Provenance

→ Related identifiers

Provenance

→ Bibliography

Data sources

Tables

Transforms

Notebooks

The workflow tree

Workflow nodes

Building a workflow

Add data to a workflow

Transforms vs. notebooks?

Copy and paste nodes

Insert nodes

Split and combine transforms

Delete nodes

Node states

Stale nodes

Edited nodes

Tree-level actions

Run all

Shift nodes

Navigate nodes

Saving

Workflow ownership and sharing

Forking workflows

Collaboration

Comments

Simultaneous editors

Workflow DOIs

Draft status

Reproducibility and change management