1 of 100

Redivis Documentation

Introduction

What is Redivis?

is an online data platform for research.

Redivis offers data distributors – from research centers and institutions to labs and individual investigators – the tools to securely host and distribute data in alignment with .

Redivis provides researchers the means to easily discover, access, and analyze data in a collaborative and reproducible manner.

Whether you are working with terabytes of high-risk data or lightweight public data files, Redivis endeavors to make data-driven research accessible to all.

Redivis for open science

Redivis is a platform for data distribution and analysis that prioritizes and . We believe that supporting open science principles drives better research outcomes, collaboration, and scientific impact.

Redivis is creating tools not just for data distribution and analysis, but also to help drive forward standards for transparent research that is accessible to all.

Reproducibility

Every Redivis dataset has a that is automatically generated while the data owner uploads and edits data contents. This history is fully exposed to end users and the dataset can be accessed at any version. Citations are generated for every version of a dataset which include the ORCID iDs for all creators. Data owners can opt to create Digital Object Identifier (DOI) for all versions of their dataset, which is available at a persistent URL along with the required metadata describing the content at that point in time.

FAIR data practices

Overview

Redivis is built on FAIR data principles to support data discovery and reusability at all levels. Data practices adhering to these principles emphasize data that is:

Findable

Data is findable when (1) data and metadata are assigned a globally unique and persistent identifier, (2) data are described with rich metadata, (3) metadata clearly and explicitly include the identifier of the data it describes, and data and (4) metadata and data are registered or indexed in a searchable resource.

At Redivis:

can be issued for datasets, facilitating authoritative citation and attribution.
All datasets have a section of documentation which auto-populates with administrator actions and allows for additional linking to other artifacts that were part of the dataset creation process.
Redivis has comprehensive search tools that index all aspects of a including the , , variable names and variable documentation.

Accessible

Data is accessible when: (1) data and metadata are retrievable by their identifier using a standardized communications protocol, (2) the protocol is open, free, and universally implementable, (3) the protocol allows for an authentication and authorization procedure, where necessary, (4) metadata are accessible, even when the data are no longer available.

At Redivis:

Data can be explored through any web browser.
Public data and analyses can be explored without an .
Researcher accounts to apply for data access or do analyses are always free.
Researcher accounts can be linked to .

Interoperable

Data is interoperable when: (1) data and metadata use a formal, accessible, shared, and broadly applicable language for knowledge representation, (2) data and metadata use vocabularies that follow FAIR principles, (3) data and metadata include qualified references to other data and metadata.

At Redivis:

Robust support interoperability with other tools.
use common languages such as SQL, Python, R, Stata, and SAS.
Data and metadata are available for in multiple common formats.

Reusable

Data is reusable when: (1) data and metadata are richly described with a plurality of accurate and relevant attributes, (2) data and metadata are released with a clear and accessible data usage license, (3) data and metadata are associated with detailed provenance, and (4) data and metadata meet domain-relevant community standards

At Redivis:

Datasets are automatically .
Many dataset, table, and variable metadata fields are automatically populated based on user action, with the ability to be manually adjusted.
All datasets have a Usage tab with information on how it has been viewed and used across Redivis
Redivis cloud-based analysis tools encourage data users to do analysis along the data rather than downloading it and breaking linkages.

Open access

Overview

The ability to view, reproduce, and build upon other works is a core tenet of the scientific process that Redivis is built to uphold and facilitate. Redivis tools for data storage and analysis comply with federal mandates for open access and automatically generate artifacts documenting all steps in the data process, ensuring reproducibility.

Background

Data retention policy

Overview

Retaining data is vital to establishing long-lasting data linkages and making research available. Redivis' data retention policy is aligned with the goal of keeping all data available and accessible whenever possible.

Owner-initiated deletion

Data owners always have the ability to remove their data from Redivis or make it inaccessible if they choose to do so.

Redivis also has policies in place in case of an accidental deletion:

All data is fully backed up, with point-in-time recovery, over a 7 day rolling window. Deletions and modifications to data is permanent after 7 days.
All metadata is backed up with point-in-time recovery over a 7 day rolling window, with daily backups of metadata extending to 1 year.

Redivis-initiated deletion

Data storage that exceeds the free tier must be paid for by the data owner or other sponsor. Upon payment failure, every effort will be made to contact the data owner and re-establish payment.

If unsuccessful, data may be destroyed after 30 days of non-payment, though all metadata, and permanent landing pages will be persisted.

Citations

Citing datasets

It's important to make sure that you fully cite any datasets used in your work. Each dataset page includes citation information in the provenance section in multiple format options.

For example the Global Historical Climatology Network dataset hosted by the Stanford Center for Population Health Sciences would be cited in the APA format as:

Stanford Center for Population Health Sciences. (2022). GHCN Daily Weather Data (Version 1.2) [Data set]. Redivis. https://doi.org/10.57761/h9ff-vy04

Citing workflows

Workflows across Redivis also include citation information in the provenance section in multiple format options.

Bibliography

Datasets and workflows on Redivis can be supplemented with related identifiers that situate a resource in its broader context. You can view the bibliography for a dataset or workflow from the corresponding provenance section.

Citing & Describing Redivis

Citing Redivis

Publications and other outputs should cite Redivis as research software, using the appropriate conventions for your publication medium.

The Digital Object Identifier (DOI) for Redivis is: https://doi.org/10.71778/V2DW-7A53
The canonical URL for Redivis is:

If referencing Redivis inline, the abbreviated citation can be used:

Redivis [https://doi.org/10.71778/V2DW-7A53]

The following BibTeX entry can be used:

These identifiers are also associated with Redivis, though do not need to be included in all citations:

Re3data identifier:
RRID:

Describing Redivis in grants and publications

We provide the language below to help you describe Redivis in your publication or grant application. It may be adapted and modified as needed. No attribution is necessary when using or modifying this content.

Overview

Redivis is a secure, scalable, cloud-based data platform developed to meet the needs of academic research. Redivis was first developed in collaboration with the and is now deployed across a number of leading research institutions [1], supporting the distribution and analysis of large, sensitive datasets across multiple disciplines in alignment with FAIR data practices.

Users of Redivis are able to upload large-scale numeric, text, structured, and unstructured data directly through the browser or via APIs and other integrations. Built-in tools are available to curate and tag rich metadata, maximizing the shareability of datasets. The platform allows for robust search and exploration across multiple datasets and their metadata and variables.

Additionally, Redivis provides a rich toolset for analysis and exploration. Users can filter, merge, analyze and visualize billions of records in real-time, and can easily bring together disparate datasets to answer novel questions. They can leverage a massively-parallelized architecture to execute SQL queries (either composed as code or through a graphical user interface), in addition to customizable Jupyter notebooks running Python, R, Stata, and SAS. REST APIs give users and applications additional programmatic interfaces to the data, ensuring interoperability with other tools and ecosystems.

Technical infrastructure

The platform is built on Google Cloud Platform infrastructure using open-source software. The main application runs in containerized services orchestrated with Kubernetes. It integrates high-performance tools including Google BigQuery and its ANSI-SQL interface for large-scale tabular data processing and JupyterLab for interactive analytics. Researchers may provision environments preconfigured for R, Python, SAS, and Stata, with support for customizable environments. Compute capacity scales dynamically to meet workload demands, with configurations available up to 416 CPUs, 11.5 TB of RAM, and 16 NVIDIA A100 GPUs, facilitating complex statistical and machine learning workflows on terabyte-scale datasets.

Reproducibility

Redivis is designed such that reproducibility is an automatic byproduct of researchers' use of the platform. A novel version control system for datasets enables efficient data updates without duplication, supporting full reproducibility and cost-efficient storage management. All analytical activity — including code, queries, and derivative outputs — is tracked and recoverable, ensuring transparency and compliance with NIH data-sharing policies.

All datasets, workflows, and versions thereof can be assigned a unique Digital Object Identifier (DOI), allowing for researchers to persistently link to a fully-reproducible artifact of their research. Future investigators, assuming they have appropriate access to the underlying data, can then re-run these analyses and produce identical results, in turn modifying and building upon prior works.

Security

Redivis supports rigorous data governance with a tiered attribute-based access control (ABAC) system, customizable user agreements, egress restrictions, and an intuitive, searchable audit trail.

Redivis has been audited and approved for the use of FERPA, PII, PHI, and HIPAA data. The platform is SOC2 and NIST 800-171 (rev 3) compliant and undergoes regular audits and penetration testing. All data and metadata are stored in a multi-redundant, AES256 encrypted datastore. All connections to the platform are over an encrypted TLS 1.2 or greater protocol. User login is managed by single sign-on via and, allowing users and collaborators to use their institution credentials to authenticate. The platform supports HTTP strict transport security (HSTS) and is on the for all major browsers, preventing users from establishing an unencrypted connection.

Data administrators on Redivis have access to detailed, searchable audit logs, and reports can easily be generated for auditability and traceability. The platform allows for the restriction of data exports and downloads as well as the automatic expiration of dataset access. For highly sensitive data, multi-layer protection exists to prevent accidental data sharing, viewing, and downloading. Redivis also allows for the complete deletion of data, including the automatic and instantaneous deletion of all data derivatives, as needed.

—

[1] , , , , ,

Guides

Getting started

Overview

Redivis is a platform that allows researchers to seamlessly discover, access, and analyze data. This brief guide will walk you through the basics to get up and running and provide a launching point for exploring other resources in this documentation.

Discover & access data

Overview

Most data on Redivis is uploaded through an Organization, and that is the best place to find Datasets to work with. Some datasets will be public, while others will require certain steps before you can get full access to the data.

1. Find datasets

From an Organization's home page, click on the Datasets tab to browse all datasets and filter by their metadata.

You can use filters on the left bar to narrow down your results. All searches perform a full-text search across a dataset, its documentation, tables, variables, and rich metadata content.

If your organization is part of an institution (such as or ) you can also go to the institution page to search all datasets in your institution.

Not finding what you need? You can also to Redivis through your workspace to augment a research project you're working on and to share with other researchers. Or if you represent a research group, center, or institution, about setting up your own Organization page.

2. Get to know a dataset

Click on any dataset title to go to that Dataset page.

You can view the metadata (including an abstract, documentation, citation information, and a version history) and dig into the data directly on the Tables and/or Files tab.

Each table contains a data browser to view cells, generate summary statistics, and display metadata for this table and each variable. Each file can be previewed and downloaded. Both tables and files can be added to workflows for analysis.

Many datasets on Redivis are public, while others have requirements for certain levels of access enforced by the data owner.

On the right side of the dataset, you can see your current access level. If you have Metadata access you can see the variable information and summary statistics, but you'll need to gain Data access in order to view the data contents or work with this data.

Learn more in the guide.

3. Apply for access

On any restricted dataset you can click the Apply for access button in the top right of the page to see a list of steps required for you to gain different access levels to this dataset.

The first step is to become a member of this organization. When you go to apply you will be prompted to become a member of this organization using the credentials on your account.

Once you're a member of this organization, you might need to fill out some requirements that administrators have set up, or request access directly from them.

Requirements are forms that you will need to complete on this page and then submit. Once these are approved by your organization's administrators you will gain access to the dataset.

Learn more in the guide.

Next steps

Start working with your data

Once you have data you're interested in, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Learn more in the guide.

Upload your own datasets

Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).

Learn more in the guide.

Discover datasets

Overview

Datasets are the core entity on Redivis, and finding the dataset(s) you want to work with is generally one of your first steps. All datasets are either uploaded by an organization or a user. If a user has uploaded a dataset they will need to share it with you directly, but you can browse datasets upload by any organization on their home page.

Apply to access restricted data

Overview

Many datasets on Redivis are public, while others have requirements for certain levels of access enforced by the data owner. If the dataset is public, you can skip over this section, otherwise you'll need to gain access before you can fully utilize the data.

Create a study

Overview

Studies are a way for one or more collaborators to work with data on the same conceptual topic. You can add multiple collaborators, datasets, and workflows to a study to organize your workflow. In addition, some restricted data requires you to submit access applications as a study via study requirements.

1. Create a new study

On the studies tab of your workspace you can create a new study and see all studies you are a part of. This study might represent a topic or group you are working with, maybe to investigate a similar cluster of research questions. You can give the study a name and description to reflect that purpose.

2. Add collaborators

If you are working with anyone else on this study you can add them as a collaborator. Anyone who is a collaborator on this study will be able to view and edit the study same as you can. Once added, this study will appear in their workspace.

PI

One person on the study can be designated as the PI. This designation does not have meaning within Redivis interfaces and is here to inform others in the group or administrators you are applying for access from of the group structure.

3. Add datasets

You can add any dataset to this study that you plan on working with, and remove it later if your topic shifts. This space is intended for you to gather resources that you are using in one place to make it easier to create workflows or manage data access. Datasets can be added to any number of studies.

4. Add workflows

You can create a new workflow in this study or move an existing one in. All workflows can be in at most one study, and you can see what study a workflow is in on the workflow's overview page.

Granting workflow access to study members

One of the options for sharing this workflow is to grant view or edit access to all members of the study it's in. This can be an easy way to give everyone in a group access to a workflow you are working on.

As with all workflows, even though someone might have access to the workflow they will need to have independent access to all the datasets it contains in order to view or query them.

5. Apply for data access via a study

If you're working with restricted data you might come across a dataset that has a study requirement as part of its access requirements. You can identify this by the study icon next to the requirement name.

Study requirements are filled out and submitted on behalf of an entire study, rather than one person. An approved study requirement will be valid for all collaborators on the study, unlike member requirements where each individual needs to complete the requirement on their own account.

To submit a study requirement you can navigate to a restricted dataset and open the access modal. You will need to select your study from the dropdown menu, and a submit button will appear. One member of your study will fill this out and submit it on behalf of your group. Once approved, all study collaborators will see an approved requirement in their access modal.

Data usage

When restricted data has a study requirement, the data administrator can approve your application submission for use of the data within that study. In order to query data in a workflow, that workflow will need to be in the approved study. If it is not, you will see a badge noting Limited access and you will not be allowed to run transforms or query this data in a notebook.

Next steps

Start analyzing your data

Once you've gained access and set up a study it's time to add your datasets to a workflow to transform and analyze them leveraging lightning fast tools from your browser.

Learn more in the guide.

Example workflows

We've started building a library of common workflow types and analytic tasks that might be useful when you're getting started working with data.

Please let us know if there are other actions or concepts you would like us to provide examples for by reaching out or emailing [email protected]

Upload unstructured data as files

Overview

Any type of file can be uploaded to a Redivis dataset, and we support previews for the most common file types. These data can be analyzed in notebooks within workflows.

Note that if you have tabular data we strongly recommend uploading it as a table (rather than a file) so you and your researchers can take advantage of our extensive toolkits for previewing and manipulating tabular data.

This guide assumes you have already started by .

Example tasks

We've started building a library of common actions that you might take when administering an organization.

Please let us know if there are other actions or concepts you would like us to provide examples for by reaching out or emailing [email protected]

Export & publish your work

Overview

Redivis combines powerful functionality to reshape and analyze data on platform, with an easy export and publishing flow, to ensure the results of your work can be displayed in the format of your choice and shared with your collaborators and research community.

Working to determine, first, the broader picture of the content you'd like to publish and where you'd like to publish it will help you, then, determine the desired shape and format of the assets to be exported and, finally, the specific sources of the data in your Redivis workspace.

Video guides

For more extensive full-workflow video walkthroughs, see our Events and press page:

Or read through our Example workflows:

Reference

Your account

Overview

Your Redivis account establishes your identity on Redivis, and provides a home for various forms of data-driven investigation and collaboration.

Because researchers regularly move between institutions, and often want to be able to reference and access historic work, it is strongly encouraged that you:

Only have a single Redivis account, much as you would only have a single Google Scholar account.
(e.g., university credentials, personal google account) to your Redivis account, to ensure that you can always access it.

Communication and privacy

Redivis will never, ever distribute your personal information to a third party. We will send data access notifications to your contact email, though this can be disabled in your . You may opt-in to receive occasional product updates, though this is turned off by default.

When you apply for access to an organization's datasets, they will be able to see some information pertinent to your membership in their organization:

Full name
Contact email
Email / institutional identifier
Affiliation, if provided by your institution's login provider

Managing logins

All datasets have their own persistent URL and are uploaded by either a user or an organization. Datasets can be added to workflows to analyze and combine with other datasets across Redivis. Datasets are automatically versioned and you can always update the version you are viewing or working with.

Some components of a dataset may not be available to you until you are granted access. In order to see the existence of a dataset, you must at least have overview access.

You can to use in your workflows and share with colleagues, or create datasets within any organization that you administer.

New to Redivis? Learn more in our guide.

Tables

Overview

All datasets that contain data will have at least one table (and up to 1,000), displayed on the Tables tab of the dataset page. Tables are the "container" for all data on Redivis, including for geospatial and unstructured data types.

Tables are created by data editors when they upload data, and users' access to these tables will be governed by the dataset's access configuration.

Tables can be explored from the dataset page, including the ability to view variables and summary statistics, cell contents, and run one-off queries against the table.

These tables are then further utilized from within a workflow.

Creating tables

New tables can be created on the Tables tab of the dataset editor page by clicking on + New table. Tables can only be created on the unreleased next version of the dataset – if the dataset doesn't have a next version, you'll need to create the next version first.

Updating tables

Table metadata, such as the table name, description, and , can be edited at any time from the dataset editor page, even on released versions. The table's data can only be updated on an unreleased version; see the for more information.

Deleting tables

Tables can only be deleted on the unreleased next version of the dataset – if the dataset doesn't have a next version, you'll need to create the next version first. Note that deleting a table on deletes that table on the version you're editing. It will still exist on any historic versions of the dataset.

Uploading data

Overview

Getting data into Redivis is the first and most important step in making it accessible for your research community. Redivis is designed to make it easy for you to securely ingest data, at scale. To begin the data upload process, first navigate to the .

For a guided walkthrough of uploading data to a dataset, please see the

Unstructured files

Overview

In order to upload unstructured data to a dataset you'll need to upload it as a file on the Files tab in the dataset uploader.

It is possible to upload tabular data as a file, but it will not be represented as a . This means it is not possible to preview the variable statistics, cells, or query interfaces, use it in a transform node, or control on different levels, and is generally not recommended.

Uploading files

From the Files tab of the , click the "Upload files" button to import one or many files. You can import files from a variety of , or perform uploads

Quotas & limits

Limits for upload file size and max files per dataset are .

Folder management

When preparing your upload you will need to select a destination folder. All files must be in one folder. These folders will have a corresponding which will allow researchers to more easily work with the files in bulk. If you are uploading many files, your folder structure will likely be important – for example, a different folder for different categories of imaging files.

You can create folders on the grey bar on the right side of the Files tab in a dataset. Folder names must be unique. You can click on the ⋮ menu next to any folder name to manage that folder.

To move files between folders you can right click on an individual file to change its folder. If you have multiple folders you can click the Move files button below files in the right bar to move files between folders.

Programmatic uploads

Overview

In addition to uploading data through the browser interface, you can leverage the and libraries, as well as the generic , to automate data ingest and data release pipelines. These libraries can be used for individual file uploads similar to the interface, as well as for streaming data ingest pipelines.

Basic examples are provided below. Consult the complete client library documentation (, ) for more details and additional examples.

Archival & deletion

Overview

Datasets on Redivis are intended as a persistent store of information, and this persistence is critical to ensure the reproducibility of analyses.

In some cases, it may be necessary to archive a dataset, so as to prevent future usage and save on cloud costs. In other scenarios, you may need to delete a specific version (e.g., to reduce storage costs, or permanently retract inappropriately released data). Finally, in some situations it may be necessary to fully delete a dataset, for example, when a license to that dataset has expired.

Archival operations are always reversible, whereas deletion of versions and datasets becomes permanent after 7 days.

Exporting content

Overview

Datasets provide a persistent, version-controlled store of data and their metadata. Datasets can be queried and analyzed within workflows, and in most cases it will make sense to first perform initial analyses on Redivis before downloading data (particularly for larger datasets).

However, in certain situations you may want to download some or all of the dataset contents, or its metadata, for future reference and analysis on external systems.

Download metadata

Top-level metadata about the dataset can be downloaded by clicking on the Download metadata link in the . Metadata can be downloaded in the DataCite, Schema.org, and Redivis API schema specifications.

Download citations

From the , click on the Bibliography button to view a full bibliography of your dataset and its data sources. You can copy or download this citation information in APA, CFF, or BibTex formats.

Download and Export data

Any table within a dataset can be download or exported to a supported environment, or read into another environment through the Redivis API, pursuant to any applied by the dataset's administrators.

Extract content via the API

The (and and wrappers) provide numerous methods for interfacing with a workflow and its contents, and in many cases will be the most flexible mechanism to extract workflow metadata and data from external systems.

Variable creation methods

When using a step (or when creating aggregate variables in an or step) you will need to select a method to specify how the variable will be created.

Every method is documented here with what information the Redivis interface needs, as well as a link to the underlying BigQuery documentation for more details.

Data repository characteristics

Overview

Redivis is designed to support data-driven research throughout the research lifecycle, and can serve as a permanent repository for your data, analytical workflows, and data derivatives.

In May 2022, the Subcommittee on Open Science (SOS) of the United States Office of Science and Technology Policy (OSTP) released a document outlining the "desirable characteristics of data repositories".

These characteristics are intended to help agencies direct Federally funded researchers toward repositories that enable management and sharing of research data consistent with the principles of FAIR data practices. Various agencies have adopted these guidelines, including the NIH.

Redivis is specifically designed to meet these desirable characteristics of a data repository, outlined below:

Desirable Characteristics for All Data Repositories

✅ Assigns a persistent identifier (PID), such as a DOI

✅ Identifier points to a persistent landing page

✅ Plan for long-term data management

✅ Maintain integrity, authenticity, and availability of datasets

✅ Stable technical infrastructure

✅ Stable funding plans

✅ Contingency plans to ensure data are available and maintained

✅ Datasets accompanied by metadata

✅ Aids in the easy discovery, reuse, and citation of datasets

✅ Schema appropriate to relevant data communities

✅ Provide or allow others to provide expert curation

✅ Quality assurance for accuracy and integrity of datasets and metadata

✅ Broad, equitable, and maximally open access to datasets and metadata

✅ Access is free of charge in a timely manner consistent with privacy

✅ Makes datasets and metadata available to reuse

✅ Provides ability to measure attribution, citation, and reuse of data

✅ Provides documentation for access and use

✅ Prevents unauthorized access, modification, and release of data

✅ Ensures administrative, technical, and physical safeguards

✅ Continuous monitoring of requirements

✅ Download, access, and export available in non-proprietary formats

✅ Ability to record the origin, chain of custody, and modification of data or metadata

✅ Provides policy for data retention

Additional Considerations for Human Data

✅ Utilizes consistent consent

✅ Enforces data use restrictions

✅ Implements measures to protect data from inappropriate access.

✅ Has a response plan for detected data breaches.

✅ Controls and audits access to and download of datasets.

✅ Has procedures for addressing violations and data mismanagement.

✅ Process for reviewing data access requests.

Detailed information

Unique Persistent Identifiers

Assigns datasets a citable, unique persistent identifier, such as a digital object identifier (DOI) or accession number, to support data discovery, reporting, and research assessment. The identifier points to a persistent landing page that remains accessible even if the dataset is de-accessioned or no longer available.

Data can be uploaded to datasets within an organization, where every version of that dataset is through the organization's . DOIs will always resolve to the URL of the . In the case when a dataset is restricted or deleted, base metadata will remain available.

Long-Term Sustainability

Has a plan for long-term management of data, including maintaining integrity, authenticity, and availability of datasets; building on a stable technical infrastructure and funding plans; and having contingency plans to ensure data are available and maintained during and after unforeseen events.

Redivis uses highly-available and redundant Google Cloud infrastructure to ensure data is stored and to the highest technical standards. Redivis maintains a formal disaster recovery and business continuity plan that is regularly exercised to ensure our ability to maintain availability and data durability during unforeseen events.

Redivis undergoes annual security audits and penetration testing by an external firm, and utilizes a formalized software development and review process to maintain the soundness of its technical infrastructure.

Funding for Redivis is provided by recurring annual subscriptions from its member academic institutions. This model provides consistent annual revenue to support the ongoing maintenance of the platform. Redivis is an employee-owned company without any external investors with an equity stake, allowing us solely focus on the needs of our customers, our employees, and our mission of improving accessibility in the research data science.

Metadata

Ensures datasets are accompanied by metadata to enable discovery, reuse, and citation of datasets, using schema that are appropriate to, and ideally widely used across, the community(ies) the repository serves. Domain-specific repositories would generally have more detailed metadata than generalist repositories.

All Redivis datasets contain extensive and . Some metadata fields (including for all variables) are automatically generated. Some fields are optional depending on the editor's insight and preference. Every dataset has space for short and long-form , , and , alongside variable .

Metadata is available in various machine readable formats, such as schema.org and DataCite JSON, and can be viewed through the interface or downloaded via the .

As a generalist repository, the metadata schema is intentionally broad and flexible, but specific groups on Redivis can choose to enforce more specific metadata standards within their datasets.

Curation and Quality Assurance

Provides, or has a mechanism for others to provide, expert curation and quality assurance to improve the accuracy and integrity of datasets and metadata.

All datasets on Redivis are owned either by an (curated by any administrator) or an individual . Additional users may be added as editors to a dataset, so as to provide further curation and quality assurance.

It is ultimately up to the editors of a dataset to provide curation, though Redivis is designed to support this process as much as possible. Redivis automatically computes checksums and runs fixity checks on all uploaded files, and computes univariate summary statistics of all variables to aid in the quality assurance process. Metadata completeness is also reported to editors, encouraging them to provide as much information as possible.

Free and Easy Access

Provides broad, equitable, and maximally open access to datasets and their metadata free of charge in a timely manner after submission, consistent with legal and ethical limits required to maintain privacy and confidentiality, Tribal sovereignty, and protection of other sensitive data.

and can be explored on Redivis without any requirement to have an account or log in. If someone would need to apply to , or they will need to make an to do so. All individual accounts are complete free and require no specific affiliation. Data access restrictions are set and maintained by the data owner.

Broad and Measured Reuse

Makes datasets and their metadata available with broadest possible terms of reuse; and provides the ability to measure attribution, citation, and reuse of data (i.e., through assignment of adequate metadata and unique PIDs).

The data owner can choose to publish any dataset publicly or set appropriate based on the sensitivity of the data. Redivis imposes no additional limits on the availability and reuse of data. Any dataset that has a DOI can be cited and tracked in publications, and any reuse on Redivis is displayed on the dataset's usage tab.

Clear Use Guidance

Provides accompanying documentation describing terms of dataset access and use (e.g., particular licenses, need for approval by a data use committee).

Dataset owners can describe any usage agreements in their and have space to document any additional usage rules. Redivis imposes no additional limits on the availability and use of data.

Security and Integrity

Has documented measures in place to meet generally accepted criteria for preventing unauthorized access to, modification of, or release of data, with levels of security that are appropriate to the sensitivity of data.

Redivis is SOC2 certified and prioritizes technical . There are multiple layers of administrative controls to make it clear what actions data owners are taking, and all actions whether administrator or researcher are automatically for review.

Redivis is well-designed to handle workflows around sharing of sensitive and high-risk data. It provides technical mechanisms to support the reuse of sensitive data when allowed, while enabling the enforcement of appropriate guardrails and access controls defined by data administrators.

Confidentiality

Has documented capabilities for ensuring that administrative, technical, and physical safeguards are employed to comply with applicable confidentiality, risk management, and continuous monitoring requirements for sensitive data.

Redivis is SOC2 certified and prioritizes technical . All data is encrypted in transit and at rest, and stored on Google Cloud infrastructure that maintains robust technical and .

In addition to an annual security audit, Redivis also undergoes annual penetration testing by an outside firm to further ensure the soundness of its security posture.

Common Format

Allows datasets and metadata downloaded, accessed, or exported from the repository to be in widely used, preferably non-proprietary, formats consistent with those used in the community(ies) the repository serves.

Data and metadata on Redivis can be imported and exported in . Data analysis is performed in common, generally open-source programming languages (SAS and Stata being available exceptions). Redivis does not introduce any of its own proprietary formats or programming languages.

Provenance

Has mechanisms in place to record the origin, chain of custody, and any modifications to submitted datasets and metadata.

All datasets contain documentation, which is automatically populated based on administrator actions. This information can be further edited or supplemented with additional related identifiers.

All modifications to a dataset or tracked in the .

Retention Policy

Provides documentation on policies for data retention within the repository.

Redivis publishes a . Data is always owned by the user or organization who uploads the dataset, and they have control over a dataset's presence and availability. The dataset owner may apply additional policies towards data retention.

Uses documented procedures to restrict dataset access and use to those that are consistent with participant consent and changes in consent.

Redivis provides extensive options to on sensitive or restricted data, including allowing access on different levels (e.g. metadata, sample, data). Access is granted and revoked on Redivis instantly, allowing for immediate changes in access based on changing circumstances.

Access rules on Redivis are defined and enforced by the dataset owner.

Restricted Use Compliant

Uses documented procedures to communicate and enforce data use restrictions, such as preventing reidentification or redistribution to unauthorized users.

Redivis has built in data available to administrators. Data download or export can be restricted completely or limited only to administrator approval in order to prevent redistribution.

Data administrators can also communicate and collect formal acknowledgement of other use restrictions through . Moreover, data administrators can easily audit the use of restricted data in the to further check for and limit any non-compliance or other misuse.

Privacy

Implements and provides documentation of measures (for example, tiered access, credentialing of data users, security safeguards against potential breaches) to protect human subjects’ data from inappropriate access.

Redivis has a built-in that allows administrators to restrict access to sensitive data. These controls apply to data derivatives as well, where any data output inherits the access rules of the source dataset(s) used to create that output. These controls on derivative data allow researchers to without worrying about accidentally leaking sensitive information, since all collaborators will need to comply with the access rules in order to view those outputs.

Additional technical safeguards ensure data privacy. Data users can establish their identity through their institutional identity provider, and Redivis undergoes regular security audits to ensure the soundness of its security posture.

Plan for Breach

Has security measures that include a response plan for detected data breaches.

Redivis has detailed internal security protocols and documented security breach plans which are regularly exercised by technical personnel via tabletop exercises. All systems are continuously monitored for potential breaches, with immediate alert pathways and clear escalation protocols to respond to any breach.

Download Control

Controls and audits access to and download of datasets (if download is permitted).

Redivis has built in data available to administrators. Data download or export can be restricted completely or limited only to specific external systems / upon administrator approval. All data downloads are logged for subsequent audit and review. Downloads are only available to authenticated users who have access to the underlying data.

Violations

Has procedures for addressing violations of terms-of-use by users and data mismanagement by the repository.

Violations of the may lead to account suspension or revocation, as outlined in the terms. Additional technical controls are in place to prevent abuse or misuse of Redivis's systems.

As a matter of policy, Redivis aims to be as permissive as possible, recognizing that often misuse is the result of accidental behavior or misunderstanding. These controls are designed to protect the system for all users, and are not intended to ever be punitive towards good-faith actors.

Request Review

Makes use of an established and transparent process for reviewing data access requests.

Administrators can define access on any restricted dataset. These access requirements are transparent to all users, and researchers must and be approved for a given set of requirements in order to gain access. Requirements also have space for both administrators and applicants to leave specifically in the context of the data application.

Reshape data in transforms

Overview

Transforming tables in a workflow is a crucial step when working with data on Redivis. Conceptually, transforms execute a query on a source table and results are materialized in a new output table. They are optimized to run on billions of records in seconds, and create a transparent, reproducible record of data transformation steps you've taken in your workflow.

In most cases you'll want to use transforms to create an output table containing all the information you're interested in before analyzing that table in a notebook or exporting it for further use.

1. Create a transform

Once you've created a workflow and added a dataset to it that you'd like to work with, get started by creating a Transform. You can do this by clicking on any dataset or table.

To build this transform, you will add steps which each take an action to shape the data in the output table. You can choose from many named steps which include a point and click interface that compile to SQL code, or you can add a SQL query step to write code directly. You can always view the code that your step is generating and switch to code if you'd like to edit it.

While SQL might not be a familiar language, it is optimized for data cleaning procedures and allows transforms to execute extremely quickly. It also allows you to write your data transformations in a declarative, reproducible manner.

2. Join additional tables

The first thing you'll want to do is decide if you have all the information you need in this table or if you'd like to join in an additional table or tables. You can reference any other table from either 1) this dataset, 2) another dataset, or 3) an output table from a different transform in this workflow. Any table you want to reference needs to be in this workflow, so you can click the Add dataset button to add a new dataset to the workflow if it's not here already.

To join a table, add a Join step and select the table you'd like to join. You'll then need to select what type of join it will be, and build your join condition. In most cases your join condition will be linking two variables of the same name and type together (e.g. join all records where id = id).

Learn more in the reference section.

3. Create new variables

You might want to generate new columns in this table by creating a new variable. You can do so by adding a Create variables step. Start by giving this variable a name and then choosing what method you want to use to create it.

Some methods are restricted to the type of variable you are working with. For example there are options to add or subtract years from a date variable, or concatenate string variables.

One of the most common new variable methods is Case (if/else). This allows you to set up a statement that looks at the content of each record and evaluates it based on conditions you've set up to generate the value. For example you can say that in the new variable you're creating, if the amount in column A is greater than 1000, the value of this new variable will be set to "high" and if not, then it will be set to "low."

You can create any number of new variables, and they will execute sequentially, allowing you to reference variable's you've created in other variables and subsequent sections.

Learn more in the reference section.

4. Filter records

You'll probably want to reduce the number of records in this table to exclude any that aren't relevant to the final output table you're creating. This will allow you to execute transforms quickly and get a better understanding of the number of relevant records you have.

To filter records, add a Filter step to your transform and start building the conditions that records will need to meet in order to stay in your output table. These statements can be nested and allow you to reference any variables or record values.

If you find yourself working with a standard list of values you're using in multiple places, this might also be a place to save time and enhance reproducibility by creating and referencing a .

Learn more in the reference section.

5. Aggregate data

Depending on the structure of your data you might want to aggregate the data to collapse multiple records down to one, while preserving some information about what was dropped. Conceptually this might look like aggregating a table of charges from one record per charge into one record per person including a variable for the total charge per person.

To get started, add an Aggregate step and select the variables that you will to aggregate on. These will be the variables that exist in your output data after the aggregation is finished. All records that are an exact match in these selected variables will be dropped.

You can also capture information about records being dropped by creating a new aggregate variable. For example, maybe you are aggregating a table with multiple test scores per person down to a table with just one record per person. You can create an aggregate variable with the average test score, or the count of the number of tests each person took.

Learn more in the reference section.

5. Select variables

Finally, before running your transform you'll always need to select which variables you want to keep in your output table. Perhaps you referenced a variable in this transform to create a new one, and now you don't need it anymore. Cutting variables means faster execution, and it is easy to add any variables back later and re-run the transform if you realize you need it, so in general try to keep this list as short as possible.

To propagate variables forward into your output table, they need to be in the right-hand box at the top of the transform labeled Keep. You can select any variable, or set of variables and click the > arrow button to move them over.

Learn more in the reference section.

6. Run transform and sanity check output

When you're ready to execute the transform click the Run button in the top right of the toolbar.

If this button is disabled, it might be because the transform is invalid for some reason. Hovering on this button will give you more information, or you can look for the alert symbol (!) to see where you'll need to fix the transform in order to make it valid to run.

After you run a transform, you can investigate the downstream output table to get feedback on the success and validity of your querying operation – both the filtering criteria you've applied and the new features you've created.

Understanding at the content of an output table allows you perform important sanity checks at each step of your research process, answering questions like:

Did my filtering criteria remove the rows I expected?
Do my new variables contain the information I expect?
Does the distribution of values in a given variable make sense?
Have I dropped unnecessary variables?

To sanity check the contents of a table node, you can inspecting the general characteristics, checking the of different variables, looking at the table's , or create a for more in-depth analysis.

7. Make changes and re-run

If there are any issues with your output table, or if you decide to go in a different direction, or add another step, it is easy to go back to your transform and start making changes.

You'll notice that this transform is now yellow and so is its output table. These edited and stale help you keep track of work you're doing in your workflow, and you can easily revert to the previous successfully executed transform state at any point.

Run this transform again to see changes in the output table. This guide describes all the steps you can take in a single transform but perhaps you want to do one step at a time and run it in between each step to sanity check your output. This system is designed for iteration so make as many changes as you want to experiment and build the output you want.

8. Create another transform

From here you can continue to create transforms on your output table, tables from the dataset you added, or add any other dataset on Redivis. You might want to fit as many steps as you can into one transform, or make a long chain to more easily track your work and communicate it to others.

As you build more transforms you'll see that sometimes actions you take create stale chains of transforms. You can easily make upstream changes in your workflow (such as upgrading a dataset from the sample to the full dataset, or updating to a new version) and then run all transforms in your workflow with one click (a Run all option is available in the Map button menu).

Next steps

Work with data in a notebook

You can use notebooks in a workflow to analyze data using Python, R, Stata, or SAS. These notebooks run in the browser with no additional configuration and seamless sharing with collaborators.

Learn more in the guide.

Export data

If you'd like to export data to a different system, you can download it in , reference it programmatically in , or visualize in tools such as .

Learn more in the guide.

Upload your own datasets

Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).

Learn more in the guide.

Redivis workflows are built for collaboration and include real-time visuals to see where collaborators with edit access are in the workflow, and a comments interface to discuss changes asynchronously.

to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.

Configure access systems

Overview

One of the main tasks of organization administrators is to manage who has access to datasets. Redivis has tools to support any level of complexity in data systems for your restricted data and makes it easy for organization administrators to define, manage, and audit access rules across these various use cases.

1. Consider your organization's access needs

Do you mostly have public datasets? Will you have a small group of researchers with access to pretty much every dataset? Will you need to gather information from researchers before they use your datasets? Do you have high risk data that can't ever leave the system? Or some combination of the above?

Redivis supports access systems for all of these configurations – for organizations of varying sizes – but you'll want to use different tools to achieve these very different goals.

For example, if you have few datasets, you might find that permission groups for bulk managing access rules aren't particularly helpful. Or if you have a small group of researchers who you all know closely, you might not need to set up process-driven access rules through requirements, and can instead grant access to certain researchers on an individual basis.

We'll walk through all the tools available to organizations for managing access, but you'll likely want to pick and choose from what's available to meet your organization's specific needs.

2. Understand access levels

All interactions with data on Redivis require the user to have the appropriate to the data for a given action. Ideally your data would be as permissive as possible to allow for the greatest exploration from researchers before they need to start applying for access.

Dataset access has five levels:

Overview: the ability to see a dataset and its documentation.
Metadata: the ability to view variable names and univariate summary statistics, but not to retrieve any identifiable information or multivariate relationships from the data.
Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.

Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.

We strongly recommend making your dataset's metadata as open as possible. This will reveal variable names and aggregate summary statistics, but will not allow researchers to view, query, or export the raw data in any way.

Being able to see metadata greatly improves researchers' discovery experience, and allows them to better assess a dataset's utility upfront, and even reduce your administrative workload. If researchers can understand a dataset before applying for access, they'll be submitting fewer access applications to datasets that are ultimately a dead end.

Learn more in the reference section.

3. Understand available access tools

Membership

Anyone wanting to apply for access to restricted datasets in your organization must first be a member. You can configure whether memberships are restricted to certain identity providers (such as your institutional login), and whether they are approved automatically or require administrator review. You also have the option to configure access to datasets to "All members."

Direct access

Permission granted directly to a researcher to instantly gain access to a dataset at a specific level. Researchers can also request access to datasets with this configuration.

Example usage: a dataset that will only be shared with a small number of people who are already known to administrators

Member requirements

A form for members to fill out. This can be set to require approval from an administrator or be automatically approved. It can also have an expiration date. These are global to your organization and when assigned to multiple datasets a user will only fill it out one time.

Example usage: a demographic form gathering researcher personal information, or a data use agreement signed PDF

Study requirements

Similar to requirements, but instead of each user needing to fill them out individually, only one requirement needs to be completed for the entire study, which can include multiple researchers. A single researcher may also have multiple studies. Each study working with the dataset will need to fill out its own study requirement, and any queries or exports of the data will be tied to that study.

Example usage: a funding proposal for a research project

Data export restrictions

A rule defining that a dataset can only be exported to a specific export environment, as configured on the Administrator panel Settings tab.

Example usage: limiting exports to a specific server environment

Learn more in the reference section.

4. Create requirements

If you want to work with requirements, you'll want to get started making them and planning out how they will work across datasets.

Perhaps you want one requirement for all members to fill out about their field of research, which is necessary to gain access to any of your datasets, but another 4 requirements with different data use agreements that will apply only to their specific datasets.

To get started, go to the Requirements tab of the administrator panel and click the New requirement button. You will need to start by selecting if this will be a Member requirement or a Study requirement.

You can use the form builder to collect different information, including standard form responses, file uploads, and e-signatures.

Learn more in the reference section.

5. Create permission groups

A permission group is an access configuration that can be assigned to multiple datasets and managed centrally.

You don't need to use permission groups, and it might not make sense to do so if each of your datasets has a different access configuration and you aren't using requirements. But if you have any overlap between datasets and want to enforce consistency, or want to use requirements you'll want to make one.

To get started, go to the Permission groups tab of the administrator panel and click the New permission group button.

This interface requires you to set an access paradigm for each access level of the dataset.

Perhaps you will set the overview access level to be public, the metadata access level to be available to all members, and data access level to be direct access (meaning you will have to directly grant access to users, or respond to their requests for access).

Or perhaps you will set the overview access level to public, and assign multiple requirements to the metadata and data access levels (meaning that anyone who is approved for all of the requirements will automatically gain that access level).

For any case you can assign data export restrictions here and choose whether you want to manage access to the dataset's sample (if it exists) differently than the full data.

If overview access to a dataset isn't public, non-approved users will not be able to see the dataset or its name in any way. In some cases, this may be the intended behavior, but remember there will be no way for researchers to apply for these datasets.

Instead, for these hidden datasets, an administrator will need to first explicitly grant overview access before researchers can view the dataset and request further access.

Learn more in the reference section.

6. Assign access permissions to datasets

Finally you'll need to apply these access permissions to actual data!

Open any of your datasets and click the Configure access button on the top of the page. This configuration setup will look very similar to configuring the permission group.

You can either create a custom configuration here, or you can assign this dataset to one of the Permission groups by clicking the dropdown menu in the top right corner of this modal.

This is also where you will manage any direct access requests for the dataset.

7. Verify your setup

As an administrator of this organization, you will have access to all datasets no matter what your access configuration is or what requirements you have filled out.

We recommend checking that your access system works as you expect by either looking at a dataset while you are logged out or in your browser's incognito mode, or by making a second Redivis account using a different email address that is not linked to your administrator account.

Next steps

Grant access to data

You have some shiny access systems, but they won't work if you don't approve user requests for access.

Learn more in the guide.

Expand your reach

If you are part of a larger institution, you can about getting an institution-wide Redivis page to help users discover new organizations and datasets.

Create an image classification model

This guide demonstrates using a Redivis workflow to train an ML model on a set of images stored in a Redivis dataset.

Workflow objective

This is an example workflow demonstrating image classification via Convolutional Neural Networks. It imports an example dataset containing several thousand test and training images of cats and dogs, with which we can train and evaluate our model.

This workflow is heavily adapted from its initial publication at: https://gsurma.medium.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8

! We also suggest you recreate this workflow as we go to best learn the process.

1. Explore data

All the image data we need is contained in the Demo organization dataset .

We can go to this dataset to browse it's tables to understand the structure of the data it contains.

We see three tables here, and all of them are file index tables. That means that each table contains an index of the files (unstructured data) this dataset contains, sorted by the folder the administrator uploaded them into. We can click on the Files tab of the dataset to see each file individually, and click on it to see a preview.

This dataset has three groupings of files:

Training images (we will use these to build the model)
Test images (images not included in the training set that we can verify the model with)
Example file types (unrelated to this workflow)

If we click on the Tables tab, and click on the training images table, we can see high level information about this set of files. We can see that there are 25,000 files, and when we click the Cells tab, all of the file names we can see end in .jpg. We can hover on these to see a preview of the image, and we can click on the file_id variable to see a preview of the image with more information.

2. Create a workflow

At the top of this dataset page we can click the Analyze in workflow button to get started working with this data.

You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.

3. Define a training set of images

We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with.

Define training set

We need to start by defining the training set, which conceptually means the set of images we know are cats and know are dogs to train the model on. Information about whether an image is a cat or dog is in the file name, so we need to pull it out into a new variable we can more easily sort on.

Click on the table Training images and create a . This interface is where we will define a query which will run against our source table and create an output table. You can choose to write the query in SQL but we will use the interface for this example since it is faster and easier to use.

Add a step and name the new variable is_cat. The method will be Regexp contains which allows us to easily identify presence of the string cat from the file_name variable. This new variable will be a boolean variable where true means the image contains a cat and false means it does not.

We want to include only some of our training set images into the training set we use to train the model, since we want to leave some aside to validate the model. So here we want to include exactly 5000 cat images and 5000 dog images. So we we will create a new variable rank and filter on it so that we only keep the first 5000 images of each type.

To do this, + Add block in the Create variables step and use the Rank method. This is an which means you will use the partition ability to partition on true and false values. For each partitioned value (true and false) a rank will be assigned.

Create a new step. Conceptually we will keep records up to 5000 in the rank variable, which means it will include 5000 true values and 5000 false values.

The final step in the transform is deciding which variables we want in our output table. We will keep our new boolean variable is_cat to use later, along with the file_id and file_name

With everything in place we can run this transform by clicking the Run button in the top right corner.

4. Sanity check the output table

Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.

Click on the output table below the transform to view it. We can see that it contains 10,000 records which is exactly what we expected

We can also inspect each variable further. If we click on the is_cat variable we can see that there are 5000 true values and 5000 false values, which shows that our filtering was successful. We can also validate that the method we used to determine if an image is a cat or a dog worked by clicking on the Cells tab. Here we can see that records marked True have "cat" in their file name, and when we hover on the file_ID value to see a preview, the image clearly contains a cat.

Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.

5. Define validation set of images

We need to create a set of image files separate from our training set where we know if the image contains a cat or dog. This will be used to validate the model training.

Create a new transform and take all the same steps as we did in the previous transform, but we will change the filter to keep images ranked 5001-7500, rather than 1-5000.

We will keep the same variables as we did in our training model, and then run this transform.

When we run this transform and inspect the output table we see what we expect here as well. There are 5000 total files and we can validate a few of them visually on the Cells tab.

6. Training the model in a notebook

Next we want to train and test a model using Python code the help of various Python libraries. Transforms are more powerful than notebooks but are based on SQL and operate linearly with only a single output table allowed. In order to work in Python, R, Stata, or SAS to generate visuals and other outputs we will create a node on our Training data output table.

When you create the notebook for the first time it will start up. Notebooks must be running to execute code.

Install packages

Redivis notebooks come with many common packages preinstalled, and you can install additional packages by clicking the Dependencies button and importing libraries in the code.

Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.

The main libraries used to create this model are and . You can view their documentation for further details.

Load training and validation sets

Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook. For this example we will remove this sample code to import data according to our library's parameters.

Define model parameters

This is where we will heavily rely on our selected libraries to build the model.

Model training

This is where we will train the model we just built using the image data we cleaned.

Evaluate model results

Now we will use the validation set to see how well our model works

Next steps

Perhaps we see something in this model we want to tweak, or we want to go back and change some of our underlying data. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and them.

Notebooks can also which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also this workflow to work on a similar analysis, or any table in this workflow for work elsewhere.

Step: Filter

Overview

The Filter step will select for rows that meet a certain set of conditions.

Example starting data:

Example output data:

Filter out rows with scores less than 70.

Step structure:

Basic state

A filter step will be made up of one or more filter blocks with a completed condition.
When there are multiple filter blocks in a step, conditions in all blocks must be met for rows to be kept.
If you have a more complex filter statement that is dependent on multiple nested conditions you can press the + button to expand the filter block.

Expanded state

When multiple conditions are needed in a block, you must specify how they relate to each other (AND vs OR)
Any nested conditions are resolved before higher level conditions.

Field descriptions

Examples

Example 1: Basic filter

Lets say we only want to reduce our table to only contain information about results from the final test.

Starting data:

Input fields:

Variable(s): The variable test is where the final value we want to evaluate on is located.
[Operator]: We want fields that exactly match, so we choose =.
Value(s) or variable(s): We want to only keep rows where

Output data:

Example 2: Multiple conditions

Let's say we don't just want values from the final but only those from the final with a score above 60.

Starting data:

Input fields:

We input the data as in the above example, but since we now have two conditions, we have to decided how they relate to each other. In this case we want data that meets all the conditions so select All conditions must be satisfied (AND).

Output data:

Example 3: Nested conditions

Let's say we want to keep all data from the final greater than 60, or any any scores above 85.

Starting data:

Input fields:

We want to keep all rows where scores are over 85 OR are from the final and over 60. So we set up the final and over 60 conditions under an AND umbrella, and nest that under the OR umbrella alongside our condition about scores over 85.
When executing, the nested conditions (scores from the final over 60) will be evaluated first to be true or false. Then the higher level condition (scores over 85 OR (scores about the final and over 60) will be evaluated. Any rows that meet this higher level condition will be kept in the output, and any that do not will be discarded.

Output data:

Example 4: Comparing variables and value lists

Let's say we only want to keep rows where scores are greater than the average score, and that are for our selected students. Right now our selected students are Jane and Pat but we know that might change in the future.

Starting data:

Input fields:

First step, we create a called selected_students with the values jane and pat on it. Then we can select it as compared to student.
- With multiple inputs on either side of a condition, the condition is evaluated as an OR. So this condition will evaluate to true if the value in student equals any value on the list (jane

Output data:

Reference: Comparison statements

A comparison always evaluates to either TRUE or FALSE.

Comparisons are made up of one or more comparison rows. A comparison row always evaluates to TRUE or FALSE; multiple rows can be nested together with a logical AND/OR.

A comparison row is made up of three components:

Left expression

The left expression can contain variable parameters from the source table, joined tables, as well as any newly created variables. All referenced new variables must be "upstream" from the current comparison, with the exception of joins, which may reference new variables that are constructed by any variables upstream of that join.

Depending on the operator selected, the left expression may contain multiple variables. In this case, each left expression will be evaluated against the right expression(s), and logically joined via an OR. If you want to take the logical AND of multiple variables, create a separate comparison row for each variable.

Multiple left hand values are only supported for = and like operators.

Where multiple variables can be entered in a comparison, you can use * in the interface to select all matching results.

For example, typing DIAG* and pressing enter will add all variables beginning with "DIAG" to this field.

Operator

Redivis supports the following operators:

=, !=

Checks if any the value(s) in the left expression are (not) equal to any values in the right expression. NULLs are treated as equivalent (NULL == NULL -> TRUE and NULL != NULL -> FALSE)

>, >=, <, <=,

Checks if the value in the left expression is less than, greater than, etc. the right expression. String comparisons are lexicographically ordered, other data types are based on the numeric / temporal order of that value. Comparisons between NULL values will always be false.

like / ! like

Checks if the string(s) in the left hand expression matches specified pattern(s) in the right hand expression. The pattern may contain the following characters:

A percent sign "%" matches any number of characters
An underscore "_" matches a single character
You can escape "\", "_", or "%" using one backslash. For example, "\%"

Right expression

The right expression can contain any variables allowed in the left expression, as well as literal values and lists. The comparison row will evaluate to TRUE when the left hand expression matches any of the right hand expressions, except for the != and !like comparators, where the comparison will evaluate to true if all values are not equal to / not like the left expression.

To match against a null datum (empty cell), you must specify the special literal value NULL here.

any of: ACDT, ACST, ACT, ACT, ACWST, ADT, AEDT, AEST, AFT, AKDT, AKST, AMST, AMT, AMT, ART, AST, AST, AWST, AZOST, AZOT, AZT, BDT, BIOT, BIT, BOT, BRST, BRT, BST, BST, BST, BTT, CAT, CCT, CDT, CDT, CEST, CET, CHADT, CHAST, CHOT, CHOST, CHST, CHUT, CIST, CIT, CKT, CLST, CLT, COST, COT, CST, CST, CST, CT, CVT, CWST, CXT, DAVT, DDUT, DFT, EASST, EAST, EAT, ECT, ECT, EDT, EEST, EET, EGST, EGT, EIT, EST, FET, FJT, FKST, FKT, FNT, GALT, GAMT, GET, GFT, GILT, GIT, GMT, GST, GST, GYT, HDT, HAEC, HST, HKT, HMT, HOVST, HOVT, ICT, IDLW, IDT, IOT, IRDT, IRKT, IRST, IST, IST, IST, JST, KALT, KGT, KOST, KRAT, KST, LHST, LHST, LINT, MAGT, MART, MAWT, MDT, MET, MEST, MHT, MIST, MIT, MMT, MSK, MST, MST, MUT, MVT, MYT, NCT, NDT, NFT, NPT, NST, NT, NUT, NZDT, NZST, OMST, ORAT, PDT, PET, PETT, PGT, PHOT, PHT, PKT, PMDT, PMST, PONT, PST, PST, PYST, PYT, RET, ROTT, SAKT, SAMT, SAST, SBT, SCT, SDT, SGT, SLST, SRET, SRT, SST, SST, SYOT, TAHT, THA, TFT, TJT, TKT, TLT, TMT, TRT, TOT, TVT, ULAST, ULAT, UTC, UYST, UYT, UZT, VET, VLAT, VOLT, VOST, VUT, WAKT, WAST, WAT, WEST, WET, WIT, WST, YAKT, YEKT

Notebook concepts

Overview

Notebooks provide a highly flexible compute environment for working with data on Redivis. In a notebook, you can reference any table in your workflow, install dependencies, perform analyses in Python, R, Stata, or SAS, store and download files, and generate an output table for downstream analysis.

Transforms vs. notebooks?

There are two mechanisms for working with data in workflows: and . Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.

Transforms are better for:

Reshaping + combining tabular and geospatial data
Working with large tables, especially at the many GB to TB scale
Preference for a no-code interface, or preference for programming in SQL

Notebooks are better for:

Interactive exploration of any data type, including unstructured data files
Working with smaller tables (though working with bigger data is possible)
Preference for Python, R, Stata, or SAS
Interactive visualizations and figure generation

Working with data

Loading data

From within your notebook, you can load any data available in your workflow. You can reference the primary source table of the notebook via the special _source_ identifier, or reference any other table in the workflow by its name. To ensure that your notebook doesn't break when tables get renamed, make sure to use the for non-primary tables. For example:

Analyzing data

Redivis notebooks support the following kernels (programming languages). For more details and examples on how to use notebooks in each language, consult the language-specific documentation:

Outputting tables

A notebook can generate an output table as the result of its execution. This output table is created programmatically, e.g.:

Storing files

As you perform your analysis, you may generate files that are stored on the notebook's hard disk. There are two locations that you should write files to: /out for persistent storage, and /scratch for temporary storage.

Any files written to persistent storage will be available when the notebook is stopped, and will be restored to the same state when the notebook is run again. Alternatively, any files written to temporary storage will only exist for the duration of the current notebook session.

To write files to these directories, use the standard tools of your programming language for writing files. E.g.,:

You can these files anytime.

Notebook management

Creation

Create a notebook by clicking on a in a workflow and selecting + Notebook. This table will become the default source table for your new notebook and will have pre-generated code that references the table's data.

Starting and stopping

Notebook nodes need to be started in order to edit or execute cells. Click the purple Start notebook button in the top right to start the notebook and provision compute resources. You can also elect to "Clear outputs and start", which will remove all outputs and reset any in the notebook.

Run notebooks in the background

When starting a notebook, you can elect the option to "Run in background". This will run a notebook similar to a , where all code is executed in series, and the notebook stops once all cells have been run, or an error occurs. This can be helpful for quickly re-running notebooks after upstream changes have been made.

Server-side execution (alpha)

When starting a notebook, you will see an option to enable server-side execution. This allows the Jupyter notebook to keep receiving outputs even if your browser is closed or disconnected, which can be particularly helpful for long-running operations. This is a new feature within Jupyter notebooks, and still has a few rough edges (e.g., progress bars often don't display), so it is currently an opt-in feature. In the future, this will become the default behavior for all notebooks.

Compute configuration

By default notebooks are provisioned with 32GB memory and 2 CPU cores, with compute power comparable to most personal computers. You can view and alter the notebook's in the More menu.

Persistence

All notebooks are automatically saved as you go. Every time a notebook is stopped, all cell inputs are saved to the notebook version history, giving you a historical record of all code that was run. Additionally, all cell outputs from the last notebook session will be preserved, as will any files written to the /out directory.

Clearing outputs

When starting a notebook, you'll be presented with the option to "Clear all outputs and start". This can be helpful in that it will reset all access rules associated with the notebook, since there is no longer any data associated with the notebook.

Choosing this option will clear all output cells in your notebook, any files saved in the /out directory, and any output tables from the notebook.

Logs

You can click the three-dot More menu to open the logs for this notebook. Opening the logs when a notebook is stopped will show the logs from the notebook's previous run.

Lifecycle

The default notebooks have a maximum lifetime of 6 hours, and after 30 minutes of inactivity, any running notebook will automatically be stopped. If you are use a notebook with , these values can be modified.

Activity is determined based on the Jupyter kernel — if you have a long-running computation, the notebook will be considered as active for the entire time.

Collaboration

All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them. Workflow viewers will see a read-only version of the notebook.

Changing the source table

To change a notebook's primary source table, either right-click on the notebook or click the three-dot (ⵗ) icon and select the "change source table" option.

Limitations

Notebooks are subject to certain .

Dependencies

All notebooks come with a number of common packages pre-installed. You can install additional packages by clicking the Edit dependencies button in the notebook start modal or toolbar.

For more detailed information about the default dependencies and adding new packages, consult the documentation for your notebook type:

For notebooks that reference restricted data, internet will be disabled while the notebook is running. This means that the dependencies interface is the only place from which you can install dependencies – e.g., running pip install for python or devtools::install() for R within your notebook will fail.

Moreover, it is strongly recommended to always install your dependencies through the dependencies interface (regardless of whether your notebook has internet access), as this provides better reproducibility and documentation for future use.

Secrets

Secrets are simple key/value pairs that are securely stored within an organization or under your account. These secrets can then be loaded in a notebook – a common use case is for storing external API tokens that then enable you to interface with these APIs from within your notebook.

Secrets are accessed via the or client libraries:

Files

Notebooks offer special capabilities for on the notebook's hard disk. Any files you've stored in a notebook's /out and /scratch directories will be available in the files modal. This modal can allow you to preview and download specific file outputs from your notebook.

Moreover, files written to the /out directory are always available, and will persist across notebook sessions. This allows for workflows where you can cache certain results between notebook sessions, avoiding the need to rerun time-intensive computations.

The files in the /scratch directory are only available when the notebook is running, and will be cleared once it is stopped. The default "working directory" of all notebooks is /scratch – this is where files will be written if you do not specify another location.

You can view the files in either directory by pressing the Files button at the top right of the notebook.

You can list files in either directory by pressing the corresponding tab, and click on any file to view it. Redivis supports interactive previews for many in the file inspector, and you can also download the file for further inspection and analysis. To download all files in a directory, click the Download all button in the files modal.

Version history

Every time you stop your notebook, all cell inputs (your code and markdown) will be saved and associated with that notebook session. You can view the code from all previous sessions by pressing the History button at the top right of your notebook, allowing you to view and share your code as it was any previous point in time.

Access rules

Determining notebook access

Your access to a notebook is determined by your corresponding access to all tables (and their antecedent datasets) referenced by the notebook. These linkages persist across notebook sessions, as a future session could reference data from a previous session. In order to reset the tables referenced by your notebook, which will also clear all outputs in the notebook, you can choose to Clear outputs and start when .

Access levels

In order to view a notebook, you must first have view access to the corresponding workflow, and in order to run and edit the notebook, you must also have edit access to that workflow.

Additionally, your access to a notebook is governed by your to its source tables. In order to run a notebook and see its outputs, you must have data access to all source tables. If you have metadata access, you will be able to see cell inputs in a notebook (that is, the code), but not outputs. If you only have overview (or no) access to the source tables, you will not be able to see notebook contents.

External internet access

If a notebook contains data with export restrictions, access to the external internet will be disabled while the notebook is running.

When the internet is disabled in a notebook you can still specify packages and other startup scripts in the Dependencies modal that will be installed on notebook start. Additionally, if any of your packages require internet access to run, you'll need to attempt to "preload" any content using a post-install script. For example, if you're using the tidycensus package in R, you could preload content as follows:

Downloading files

Typically, you will be able to download any files written to the notebooks /out or /scratch directories. However, if a notebook references data with export restrictions, you will not be able to download these files, unless the file size is smaller than the relevant specified on source datasets.

Exporting notebooks

Notebooks can be downloaded as PDF, HTML, and .ipynb files by clicking the three-dot More button at the top right of the notebook.

You will be given the option of whether to include cell outputs in your export — it is important that you ensure the outputs displayed in your notebook do not contain sensitive data, and that your subsequent distribution is in compliance with any data use agreements.

Step: Join

Overview

A Join step will combine data from two tables based on a join condition so it can be queried together.

Example output data:

Inner join where student = student

Step structure

There will be at least one join block where you will define a table to join and a join condition for that table.
When multiple blocks exist, the tables will be joined in sequence.

Field descriptions

Field

Description

Join types

Inner join

If a row in either the source or join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.

Left join

If a row in the source doesn’t have a match, all joined variables will be null for that row. If a row in the join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.

Right join

If a row in the source doesn’t have a match, it will be dropped. If a row in the join table doesn’t have a match, all source variables will be null for that row. If a row is matched multiple times, it will be multiplied in the output. This is the same as a left join, but reversed.

Full join

If a row in either the source or join table doesn’t have a match, all source variables will be null for that row, including for the column being joined upon. If a row is matched multiple times, it will be multiplied in the output.

Cross join

Every row in the source table will be combined with every row in the joined table. This might be used to perform a join on a new variable that will be created downstream (such as in a geospatial join). You will almost always need to use a filter after this join for the query to successfully execute.

It is strongly recommended to use a step with a cross join to avoid a massively expanded table.

Geospatial joins

When querying geospatial data, you'll often want to match records where one polygon is contained in another, or otherwise overlaps. To perform a geospatial join, you'll typically perform the following steps:

First, create an with the table you'd like to join on.
Next, create variable(s) using a geography method that will represent your join condition. For example, you might use the method to join all geometries in one table that exist within another.
Finally, implement a that tests against the newly created variable. In our example above, if we created an is_contained_by variable, we would filter on the condition is_contained_by = TRUE

There are a few performance pitfalls when performing geospatial joins. Most notably, combining other equality comparisons with the geospatial condition in a filter can prevent the query planner from leveraging geospatial indexes, leading to a massive performance de-optimization.

For example, executing a filter of the form t0.state = t1.state AND is_contained_by=TRUE would actually be significantly less performant than if the state equality comparison is removed, even though it seems that this would reduce the number of times the geospatial condition needs to be evaluated.

For more discussion on geospatial performance in BigQuery, Redivis's underlying querying engine, see and .

Join structures

Depending on the structure of your data, joins might multiply in rows.

1-to-1 joins

If the variable you are joining on has no duplicate values in either the source table or the joined table, then your output table will always have the same or fewer rows than the sum of both table's rows added together.

1-many joins

If the variable you are joining on has duplicates in either table, any time there is is a duplicate row it will be matched for each duplicate.

Many-to-many joins

If the variable you are joining on has duplicates in both tables, any time there is is a duplicate row it will be matched for each duplicate.

Examples

Example 1: Simple join condition

Let's say our information about student absenses and test scores are in separate tables and we want to join them together.

Starting data:

Input fields:

Join table: The data we want to join is in Table 1 so we select it here.
Join type: Since we only care about students that have both absence and score information we choose inner join which will drop any rows without a match in both tables.
Source variable / Joined table variable: We want to match on the variable student

Output data:

Example 2: More specific join condition

Let's say instead of a table with aggregated absences we have a daily timeliness chart. We want to join the scores for the corresponding attendance information for both the student and date in question.

Starting data:

Input fields:

Join type: Since we want to keep our absence data whether a test was taken or not, we do a Left join.

Join condition: We join on fields where BOTH the values in the student and date variables are a match by selecting both in our join condition.

Output data:

You can use a step to build more complex join conditions

Example 3: Complex join conditions with cross joins

In some cases you might need to do a more complex join, such as using an operator other than =, or joining on a condition that uses a variable from your joined table that needs to be or first. This is common in geospatial joins but might come up in any situation.

To use a more complex join condition you can use a cross join, followed by . The cross join will make all possible joins between these two tables and then the filter will limit it to rows matching your condition. Effectively this will execute the same as if you had done your initial join with a more complex condition.

Let's say in the example below we want to do an inner join where T0 student_id = T1 student_id but in the source table student_id is an integer variable and in Table 1 it is a string. We want to retype the variable in Table 1 before we can join on it, but we can't retype it until we have joined in Table 1. So we will do a cross join, a retype, then a filter to narrow down to only records meeting our condition.

Starting data:

Input fields:

Output data:

Continuous enrollment

The following example is illustrated on Redivis in the MarketScan Continuous Enrollment workflow – you'll need access to MarketScan data to view the details.‌

Many insurance claims datasets on Redivis contain information about enrollment, detailing the periods of time when an individual represented in a dataset was covered by an insurance plan. If you intend to characterize patients based on their insurance coverage (or lack thereof) during certain key events (procedures, diagnoses, etc), it's often important to identify periods of continuous enrollment for each individual – and capture each continuous enrollment period for each patient in a single row.‌

These claims datasets describe enrollment information in multiple discrete rows per patient, each corresponding to patient per month. However, an overall continuous enrollment period may be broken up across rows into months or other non-uniform chunks, so we'll employ the following process to combine multiple sequential rows into a single row with one start date and one end date, describing one continuous period.‌

In this example, we will process the table to create a table in which each row describes a single period of continuous enrollment. We show an artificial snapshot of the dataset below, where patient 1 has multiple periods of continuous enrollment due to some gaps in coverage, and patient 2 has a single period of continuous enrollment.

We want to create a final table with 3 rows for patient 1 to account for gaps in enrollment in March and May of 2012, and 1 row for patient 2. Our desired output has a row for each continuous period per patient, shown below:

The variable names in this example are not actual MarketScan variable names. But, with appropriate data access, you can see the real variables used in the first transform of the Redivis example workflow by hovering over the (renamed) variables patient_id, enrollment_start_date, and enrollment_end_date in the Keep section.‌

(1) Add start and end of adjacent periods to each row

Throughout this example, we'll create variables that partition the dataset by patient identifier (here, patient_id) to ensure that each patient is processed individually. But, to account for the fact that a single patient may have many periods of continuous enrollment, each spanning many months, we need to identify the correct start (from enrollment_start_date) and end (from enrollment_end_date) of a continuous period out of multiple rows and capture them in a single row.‌

First, we create a partition variable lag_end_date using the lag method, which will order each row by enrollment_start_date in a given patient_id partition and copy the previous row's enrollment_end_date value into the each row. We also create lead_start_date using lead, to copy the following row's enrollment_start_date value into each row.‌

These methods generate values which tell us how close the preceding and following enrollment periods are with respect the each row's enrollment period.

(2) Select rows with start and end of continuous periods

In a second, downstream transform we'll create new variables, which will use the above lead and lag values to identify which rows correspond to the beginning and end of a continuous enrollment period.‌

First, we'll compare enrollment_start_date and lag_end_date to find the difference (in days) of the start of each period and the end of the previous period. and compare enrollment_end_date and lead_start_date to find the end of the period.‌

We see via diff_lag_end_enrollment_start (created using the date diff method) which rows describe an enrollment period directly following the previous period, and which rows describe an enrollment period with a larger gap since the previous period. We also create diff_enrollment_end_lead_start to identify gaps between an enrollment period and the next period.

Next, we'll encode booleans from our difference variables to simplify further filtering. We'll identify rows corresponding to the start of a continuous period as those with a diff_lag_end_enrollment_start value of either NULL (the row is the first period in the partition) or greater than 1 (the row comes after a gap in enrollment). And we identify rows corresponding to the end of a continuous period as those with a diff_enrollment_end_lead_start value of either NULL (the row is the last period in the partition) or greater than 1 (the row comes before a gap in enrollment).‌

We see via boolean variables is_start_continuous_period and is_end_continuous_period (created the method) if a given row corresponds to the start of a continuous period, the end of a continuous period, or both.

Then, to capture only the start and end of a continuous period, we'll use a to keep only rows which are true for either is_start_continuous_period or is_end_continuous_period.‌

This leaves us with either 1 or 2 rows corresponding to a continuous enrollment period. If a continuous period spans multiple rows (months, in this case), we'll have 2 rows (a start row and an end row). But if a period only spans one row, we'll have both start and end captured by that 1 row. In our example, the row containing the middle (neither start nor end) of the 2012-06-01 to 2012-08-31 enrollment period for patient 1 was dropped. We can also ignore our intermediate lead..., lag..., and diff... variables, since our final processing step will only consider a row's is_start_continuous_period and is_end_continuous_period values.

(3) Add end date of following enrollment periods to each row and collapse

Finally, we want to collapse our table to ensure 1 row per continuous enrollment period per patient. Since we have only rows corresponding to start and end of continuous periods, we create another partition variable lead_end_date (in the same transform is fine, since this step will happen after the previous filter) which copies the enrollment_end_date value of the following row on to each row.‌

We can also use the partition row filter to keep only rows with is_start_continuous_period as true, since our lead_end_date has copied over the end date of the continuous enrollment period, contained in each row's following row.

We end up with a table where each and every row contains both the start date of the continuous enrollment period and the end date of that continuous enrollment period.

A final processing step captures the correct end date of a continuous period. We now have only rows whose enrollment_start_date value contains the start of a continuous period, but these rows fall into two categories:‌

First, we have the rows that do not also correspond to the end of an enrollment period (where the is_end_continuous_period value is false). For these, we want to look at lead_end_date, the end date of the next row, which represents the final date of the continuous period, since we filtered out all the intermediate rows above.
Second, we have which also correspond to the end of an enrollment period (where the is_end_continuous_period value is true) – in this example, if the continuous period was only 1 month. For these, the row defines a period (in this example, 1 month) that also contains the end date, so we just get the

We capture the above logic in a new variable enrollment_end_date_continuous, created in an additional transform, since our previous operation involved a partition.

We end up with a final table below, containing the patient identifier, and the start date (renamed to enrollment_start_date_continuous for consistency, and end date of each continuous enrollment period.

Placeholder (in UI)