Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 164 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

Redivis Documentation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Guides

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Reference

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Redivis for open science

Redivis is a platform for data distribution and analysis that prioritizes open access and FAIR data practices. We believe that supporting open science principles drives better research outcomes, collaboration, and scientific impact.

Redivis is creating tools not just for data distribution and analysis, but also to help drive forward standards for transparent research that is accessible to all.

Reproducibility

Every Redivis dataset has a versioned history that is automatically generated while the data owner uploads and edits data contents. This history is fully exposed to end users and the dataset can be accessed at any version. Citations are generated for every version of a dataset which include the ORCID iDs for all creators. Data owners can opt to create Digital Object Identifier (DOI) for all versions of their dataset, which is available at a persistent URL along with the required metadata describing the content at that point in time.

As users analyze data in a Redivis workflow each step they take is recorded in a visual tree that can be navigated sequentially. Every step has a full history which can be viewed at any point in time.

Availability

Datasets in any format can be deposited into Redivis and made available immediately to the world via open access licenses at no cost to the end user.

All Redivis datasets and analysis workflows are available to anyone on the internet without a Redivis account unless the owner decides to add restrictions. In this case someone wishing to view the work would need to create a free Redivis account to request access from the owner.

Longevity

The stable technical infrastructure and dedicated Redivis team is funded by a coalition of academic institutions paying license fees for large scale data distribution. This diverse group of funders ensures the long-term availability of the datasets.

Example workflows

We've started building a library of common workflow types and analytic tasks that might be useful when you're getting started working with data.

Please let us know if there are other actions or concepts you would like us to provide examples for by reaching out or emailing [email protected]

Data retention policy

Overview

Retaining data is vital to establishing long-lasting data linkages and making research available. Redivis' data retention policy is aligned with the goal of keeping all data available and accessible whenever possible.

Owner-initiated deletion

Data owners always have the ability to remove their data from Redivis or make it inaccessible if they choose to do so.

Redivis also has policies in place in case of an accidental deletion:

  • All data is fully backed up, with point-in-time recovery, over a 7 day rolling window. Deletions and modifications to data is permanent after 7 days.

  • All metadata is backed up with point-in-time recovery over a 7 day rolling window, with daily backups of metadata extending to 1 year.

Redivis-initiated deletion

Data storage that exceeds the free tier must be paid for by the data owner or other sponsor. Upon payment failure, every effort will be made to contact the data owner and re-establish payment.

If unsuccessful, data may be destroyed after 30 days of non-payment, though all metadata, and permanent landing pages will be persisted.

Example tasks

We've started building a library of common actions that you might take when administering an organization.

Please let us know if there are other actions or concepts you would like us to provide examples for by reaching out or emailing [email protected]

Introduction

What is Redivis?

Redivis is an online data platform for research.

Redivis offers data distributors – from research centers and institutions to labs and individual investigators – the tools to securely host and distribute data in alignment with FAIR data practices.

Redivis provides researchers the means to easily discover, access, and analyze data in a collaborative and reproducible manner.

Whether you are working with terabytes of high-risk data or lightweight public data files, Redivis endeavors to make data-driven research accessible to all.

Interested in bringing your organization on to Redivis? Contact us!

Who uses Redivis?

Research institutions provide a unified discovery and administrative layer for their organizations' data across Redivis.

Research organizations upload, catalogue, and grant access to datasets.

Researchers upload their own datasets, discover existing data, and execute flexible, high-performance analytic workflows in a collaborative and reproducible manner.

Redivis makes it easy for everyone to:

  • Discover datasets

  • Apply for access

  • Query large datasets

  • Analyze data

  • Collaborate

  • Host, brand, and distribute data within an organization

  • Version & document datasets

  • Integrate with existing data pipelines

Using this documentation

This documentation is broken up into Guides, Examples, and Reference sections.

We recommend that new users begin by following the relevant guides and examples, while the reference section can serve as a comprehensive overview of everything that Redivis can do.

Certain users may want to dive into the more technical API documentation for interfacing with Redivis via R, Python, Javascript, and the HTTP REST API.

Variable selection

Overview

In the bottom pane of the transform you will need to select the variables you'd like to keep in your output table. One or more variables must be kept in order to run the transform.

Usage

Click on a variable and use the center buttons to move it between the two lists. You can use command-/control- and shift-click to select multiple variables at once.

For larger variable lists, clicking the search icon will allow you to quickly filter the available variable options down based on search terms.

Distinct

You can select the Distinct box to drop any records that are an exact copy of another record across values in all variables in your output table.

This can be useful in situations where you create variables using analytic methods or if your data is in a particular format. Otherwise this might slow down your transform run time unnecessarily.

Source indicators

Hovering on any variable will show a tooltip with information about where the variable came from, it's type, and any assigned label.

You will also see to the left of variables in the list an icon showing the variable's source. For variables originating in the source table, the icon will show t0 and for variables from subsequently joined in tables the icon will show t1, t2, etc. For variables created in this transform it will show v.

Selection options

Some steps will add or remove variables as options for selection to keep in the discard table. Join and occasionally Merge will bring in variables from other tables. Create variables will make new variables. Aggregate will remove all variables that are not included in the collapse, but has the potential to add newly created aggregate variables.

Variable creation methods

When using a create variables step (or when creating aggregate variables in an aggregation or pivot step) you will need to select a method to specify how the variable will be created.

Every method is documented here with what information the Redivis interface needs, as well as a link to the underlying BigQuery documentation for more details.

Citations

Citing datasets

It's important to make sure that you fully cite any datasets used in your work. Each dataset page includes citation information in the provenance section in multiple format options.

For example the Global Historical Climatology Network dataset hosted by the Stanford Center for Population Health Sciences would be cited in the APA format as:

Stanford Center for Population Health Sciences. (2022). GHCN Daily Weather Data (Version 1.2) [Data set]. Redivis. https://doi.org/10.57761/h9ff-vy04

Citing workflows

Workflows across Redivis also include citation information in the provenance section in multiple format options.

Bibliography

Datasets and workflows on Redivis can be supplemented with related identifiers that situate a resource in its broader context. You can view the bibliography for a dataset or workflow from the corresponding provenance section.

Upload unstructured data as files

Overview

Any type of file can be uploaded to a Redivis dataset, and we support previews for the most common file types. These data can be analyzed in notebooks within workflows.

Note that if you have tabular data we strongly recommend uploading it as a table (rather than a file) so you and your researchers can take advantage of our extensive toolkits for previewing and manipulating tabular data.

This guide assumes you have already started by Creating a dataset.

1. Locate the data you want to upload

You can upload data directly from your computer, or import from a linked account.

If importing, you'll want to get the relevant external account configured to your Redivis account before getting started.

The import tools allow for multiple uploads, so it's helpful to have them all in the same place.

2. Upload files

When you're ready click the Upload files button on the Files tab of the dataset page to get started and select the location your files will be coming from.

If you select "Computer" then you will get a browser window where you can choose the file, files, or folder you want to upload from. Choosing a different location will bring up the corresponding mechanism to select your files.

Once selected, you will need to choose a Folder to put them in. All files on a dataset live within one folder which have a corresponding index table to help researchers understand and navigate those files. You can change this later if needed. If this is your first time uploading files to this dataset you will need to create your first folder.

Click the "Upload" button to start the process. If you are uploading from your computer you'll need to wait on this screen until it's complete. Closing your window will end the upload process. If uploading from a linked source, you can close this window to allow this process to continue in the background.

3. Manage files and folders

You can click on the file name in the Files tab to preview the file and see its information.

You can also go to the Tables tab to view the index table for the folder you've uploaded files to. View the cells and hover or click on the file ID variable to see the preview here. This table can be used in Transforms within a Workflow to work with files on a large scale.

If you want to rename a folder you can do so on the Files tab by right clicking on the folder name in the right bar.

You can also create new folders and move files between two existing folders from this bar. When moving between folders, you can use a conditional statement to only move some files that match your conditions.

Next steps

Continue uploading your dataset

Great metadata makes your dataset useable. Complete your metadata, along with configuring access, creating a sample, and releasing this version.

Learn more in the Create & manage datasets guide.

Studies

Overview

A study is a group of users working together with a common goal. Studies allow users to apply for access to study requirements on datasets as a group. It also makes sharing workflows and user-created datasets easier.

Studies

You can create a study by clicking the + New button on the studies tab of your workspace. Any study you are a part of will appear in your workspace. Any edits you make will appear for all study collaborators.

Edit

You can add collaborators to your study. Collaborators can edit the study and will by default have edit access to all workflows created in this study.

When you create a study, you are assigned as the study's PI in the collaborators list. Organization administrators use this field to better track studies applying for their data.

If you add more users to a study you can change the PI assignment. Over on a user's name and click Make PI to transfer. Note that once you change the PI you will no longer have the ability to change the PI assignment.

You can also give a study a description, which will help administrators and other study members better understand its goals.

Workflow

Any workflow created in this study will share edit access with all other collaborators in this study by default.

Datasets

The list of datasets will be automatically populated as you use datasets in workflows in this study. You can all add datasets to it directly in order to make the access process easier later on.

Compute credits and billing

Overview

Compute credits can be used to provision advanced compute environments for notebooks.

The default notebook configuration is a free resource on Redivis, and has access to 2 CPUs and 32GB working memory, alongside a 60GB SSD disk and gigabit networking. This is similar to most personal computers, and for many analyses should be plenty of compute power!

If you are working with larger tables, creating an ML model, or need to use a GPU you can choose to configure more advanced compute resources when setting up your notebook.

Advanced compute environments will cost a number of compute credits per hour to run depending on the machine type. This amount will be clear when configuring and starting the notebook. In order to start a notebook with an advanced compute configuration you must have enough compute credits to cover at least 15 minutes of running time.

Purchase compute credits

Credits cost $0.10 and can be purchased in increments of 100 ($10), 200 ($20), 500 ($50), or 1000 ($100). Click the Purchase credits button and choose the amount you'd like to purchase.

You can purchase credits immediately using a credit card or bank account, or through an invoice.

Credit card / Bank

When clicking Checkout you will be rerouted to a Stripe payment processing page. The card you enter here will be processed by Stripe and never seen by Redivis or stored in our databases.

Invoice

You can choose to generate an invoice for your credit purchase. Any information you enter in the custom fields section will appear on the invoice and might be required by the organization paying the invoice. Once generated the invoice will be downloaded as a PDF to your computer. You can also return to view the invoice and copy the link to the invoice to pay electronically. Once the invoice is paid the credits will appear in your account.

Auto-purchase compute credits

You can set up your account to purchase credits every time your account dips below 10 compute credits. This ensures that your notebooks will never halt mid-session due to a lack of compute credits. Here, you can select the amount of credits you'd like to purchase every time this condition is met and the card you would like to be charged for it. If you'd like to change the amount or cancel auto-purchase you can do so by returning to this screen.

Refund compute credits

Credits on Redivis never expire and can be refunded at any time. If you would like to refund the compute credits on your account please contact us to initiate the refund process.

Billing dashboard

Redivis uses Stripe for all payment processing. If you'd like to view or edit your card on file or view previous purchases you can do so in the Stripe billing dashboard, which is linked from this page.

Usage limits

User accounts are currently limited to 10GB of storage. You can see your current usage on the Settings page of your Workspace, under the Compute Credits and Billing tab.

If you'd like to increase that limit and host larger datasets, Contact us about creating an Organization!

Tables

Overview

All datasets that contain data will have at least one table (and up to 1,000), displayed on the Tables tab of the dataset page. Tables are the "container" for all data on Redivis, including for geospatial and unstructured data types.

Tables are created by data editors when they upload data, and users' access to these tables will be governed by the dataset's access configuration.

Tables can be explored from the dataset page, including the ability to view variables and summary statistics, cell contents, and run one-off queries against the table.

These tables are then further utilized from within a workflow.

See here for full documentation concerning tables and their functionality.

Creating tables

New tables can be created on the Tables tab of the dataset editor page by clicking on + New table. Tables can only be created on the unreleased next version of the dataset – if the dataset doesn't have a next version, you'll need to create the next version first.

Updating tables

Table metadata, such as the table name, description, and entity, can be edited at any time from the dataset editor page, even on released versions. The table's data can only be updated on an unreleased version; see the uploading data documentation for more information.

Deleting tables

Tables can only be deleted on the unreleased next version of the dataset – if the dataset doesn't have a next version, you'll need to create the next version first. Note that deleting a table on deletes that table on the version you're editing. It will still exist on any historic versions of the dataset.

FAIR data practices

Overview

Redivis is built on data principles to support data discovery and reusability at all levels. Data practices adhering to these principles emphasize data that is:

Findable

Data is findable when (1) data and metadata are assigned a globally unique and persistent identifier, (2) data are described with rich metadata, (3) metadata clearly and explicitly include the identifier of the data it describes, and data and (4) metadata and data are registered or indexed in a searchable resource.

At Redivis:

  • can be issued for datasets, facilitating authoritative citation and attribution.

  • All datasets have a section of documentation which auto-populates with administrator actions and allows for additional linking to other artifacts that were part of the dataset creation process.

  • Redivis has comprehensive search tools that index all aspects of a including the , , variable names and variable documentation.

Accessible

Data is accessible when: (1) data and metadata are retrievable by their identifier using a standardized communications protocol, (2) the protocol is open, free, and universally implementable, (3) the protocol allows for an authentication and authorization procedure, where necessary, (4) metadata are accessible, even when the data are no longer available.

At Redivis:

  • Data can be explored through any web browser.

  • Public data and analyses can be explored without an .

  • Researcher accounts to apply for data access or do analyses are always free.

  • Researcher accounts can be linked to .

Interoperable

Data is interoperable when: (1) data and metadata use a formal, accessible, shared, and broadly applicable language for knowledge representation, (2) data and metadata use vocabularies that follow FAIR principles, (3) data and metadata include qualified references to other data and metadata.

At Redivis:

  • Robust support interoperability with other tools.

  • use common languages such as SQL, Python, R, Stata, and SAS.

  • Data and metadata are available for in multiple common formats.

Reusable

Data is reusable when: (1) data and metadata are richly described with a plurality of accurate and relevant attributes, (2) data and metadata are released with a clear and accessible data usage license, (3) data and metadata are associated with detailed provenance, and (4) data and metadata meet domain-relevant community standards

At Redivis:

  • Datasets are automatically .

  • Many dataset, table, and variable metadata fields are automatically populated based on user action, with the ability to be manually adjusted.

  • All datasets have a Usage tab with information on how it has been viewed and used across Redivis

  • Redivis cloud-based analysis tools encourage data users to do analysis along the data rather than downloading it and breaking linkages.

  • containing analyses are self-documenting and capture the full analysis pipeline.

  • Anyone can a workflow they have access to in order to continue an analysis.

Create a study

Overview

Studies are a way for one or more collaborators to work with data on the same conceptual topic. You can add multiple collaborators, datasets, and workflows to a study to organize your workflow. In addition, some restricted data requires you to submit access applications as a study via study requirements.

1. Create a new study

On the studies tab of your workspace you can create a new study and see all studies you are a part of. This study might represent a topic or group you are working with, maybe to investigate a similar cluster of research questions. You can give the study a name and description to reflect that purpose.

2. Add collaborators

If you are working with anyone else on this study you can add them as a collaborator. Anyone who is a collaborator on this study will be able to view and edit the study same as you can. Once added, this study will appear in their workspace.

PI

One person on the study can be designated as the PI. This designation does not have meaning within Redivis interfaces and is here to inform others in the group or administrators you are applying for access from of the group structure.

3. Add datasets

You can add any dataset to this study that you plan on working with, and remove it later if your topic shifts. This space is intended for you to gather resources that you are using in one place to make it easier to create workflows or manage data access. Datasets can be added to any number of studies.

4. Add workflows

You can create a new workflow in this study or move an existing one in. All workflows can be in at most one study, and you can see what study a workflow is in on the workflow's overview page.

Granting workflow access to study members

One of the options for sharing this workflow is to grant view or edit access to all members of the study it's in. This can be an easy way to give everyone in a group access to a workflow you are working on.

As with all workflows, even though someone might have access to the workflow they will need to have independent access to all the datasets it contains in order to view or query them.

5. Apply for data access via a study

If you're working with restricted data you might come across a dataset that has a study requirement as part of its access requirements. You can identify this by the study icon next to the requirement name.

Study requirements are filled out and submitted on behalf of an entire study, rather than one person. An approved study requirement will be valid for all collaborators on the study, unlike member requirements where each individual needs to complete the requirement on their own account.

To submit a study requirement you can navigate to a restricted dataset and open the access modal. You will need to select your study from the dropdown menu, and a submit button will appear. One member of your study will fill this out and submit it on behalf of your group. Once approved, all study collaborators will see an approved requirement in their access modal.

Data usage

When restricted data has a study requirement, the data administrator can approve your application submission for use of the data within that study. In order to query data in a workflow, that workflow will need to be in the approved study. If it is not, you will see a badge noting Limited access and you will not be allowed to run transforms or query this data in a notebook.

Next steps

Start analyzing your data

Once you've gained access and set up a study it's time to add your datasets to a workflow to transform and analyze them leveraging lightning fast tools from your browser.

Learn more in the guide.

Export & publish your work

Overview

Redivis combines powerful functionality to reshape and analyze data on platform, with an easy export and publishing flow, to ensure the results of your work can be displayed in the format of your choice and shared with your collaborators and research community.

Working to determine, first, the broader picture of the content you'd like to publish and where you'd like to publish it will help you, then, determine the desired shape and format of the assets to be exported and, finally, the specific sources of the data in your Redivis workspace.

1. Develop a publishing strategy

Ask yourself: What story are you trying to tell? Who/where is the audience? What component pieces are crucial to building the narrative? Your ideal package of assets may be very different if you're hoping to share progress on a workflow with a principal investigator, take a snapshot of a single variable distribution for a colleague, publish results in a journal, or build a custom dashboard to highlight multiple trends.

You may use a combination of tables, unstructured files, code snippets, graphs, and descriptive text to be published, so sketching out these component pieces in increasing detail will help define your end product.

2. Choose your desired formats

With an initial strategy in mind, learning about the different types of Redivis exports will help define your list of assets to generate.

Tabular data containing rows and columns can be in a variety of formats, accessed from many environments via our client libraries, or in a website.

Unstructured data files of any type can be in their original format, or accessed .

Notebooks containing code inputs and corresponding outputs can be as an .ipynb file or PDF or HTML, or (coming soon!) embedded in your site.

Learn more in the guide.

3. Identify the data sources

In a , you can generate output tables to capture the result of a set of data transformations, or use notebook to show a line-by-line data analysis flow and relevant figures.

As a , you can upload your own tabular data and unstructured files to create assets to use in your workflows or share with others.

Whether you're manipulating data in a workflow to showcase results or hosting your own dataset, you'll want to build a set of specific tables or notebooks you're trying to share. As you iteratively modify the shape and content of these assets – in a workflow to output new tables or to build analysis flows in Python or R – you'll fine-tune each piece of your final publication.

Next steps

Upload your own datasets

Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).

Learn more in the guide.

Build a custom dashboard

For full customizability, you can publish a static site that accesses Redivis data to power an interactive visual dashboard – a more permanent, web-based approach to highlighting results of your data analysis or generated data.

Learn more in the guide.

Video guides

For more extensive full-workflow video walkthroughs, see our Events and press page:

Or read through our Example workflows:

Datasets

Overview

Datasets are a core component of Redivis. Datasets contain various metadata and documentation, as well as one or more and/or containing data.

All datasets have their own persistent URL and are by either a user or an organization. Datasets can be added to to analyze and combine with other datasets across Redivis. Datasets are automatically versioned and you can always update the version you are viewing or working with.

Some components of a dataset may not be available to you until you are . In order to see the existence of a dataset, you must at least have .

You can to use in your workflows and share with colleagues, or create datasets within any organization that you administer.

New to Redivis? Learn more in our guide.

Managing logins

Overview

In order to create a Redivis account, you must authenticate through your academic institution, through any Google account, or via passwordless login. Over 2,500 academic institutions from around the world are currently supported, with more being added regularly.

Redivis will request some basic information when you authenticate, such as your name, email, and any relevant affiliations; it will not have access to any other information through your Google or institutional account.

Adding emails / authentications to your account

It is strongly encouraged that each individual has only one Redivis account, and that your identity on Redivis map to your real-world identity. However, it is common that you might have multiple emails that you want to associate with your Redivis account: personal email, university email, visiting-scholar university email.

In order to add an email to your account, navigate to the , and click Add authentication in the "Authentications" tab. A new browser window will open requesting the relevant credentials, after which this new email will be associated with your account.

If you lose access to all of emails associated with your account, you will no longer be able to log in to Redivis.

For this reason, it is strongly encouraged that you add a personal email to your account. This way, if you ever lose access to your institutional email, you will still be able to log in to your Redivis account (though you may not have access to certain datasets that required your previous institutional affiliation - see below).

Logins and organization membership

When you join an organization, your membership in that organization will be associated with a particular login. In order to work with that organization's restricted data, you must have recently authenticated with the login associated with your membership. If you sign in to Redivis using a different login (for example, you signed in with your personal email, while your membership is associated with you academic institution's credentials), you'll be prompted to re-authenticate with the relevant login before you can work with restricted data.

If you lose access to the login originally associated with your membership — for example, if you change institutions — you can request that your membership be updated to associate it with a new login. An administrator of the organization will then need to approve this request before access through your membership will be restored.

Merging accounts

If you have more than one Redivis account, we strongly encourage you to merge them. In order to merge multiple accounts, you should perform the following steps:

  1. Identify which account should be your "primary" Redivis account once the process is complete.

  2. Log in to your non-primary account.

  3. Transfer ownership of all datasets, workflows, and studies to your primary account. Alternatively, you can delete any datasets, workflows, and/or studies that are no longer relevant.

  4. Delete the non-primary account by clicking "Delete account" within your . WARNING: This action cannot be undone.

  5. Log in to your primary account, and add the authentication(s) that were previously associated with your non-primary account to your primary account.

Exporting content

Overview

Datasets provide a persistent, version-controlled store of data and their metadata. Datasets can be queried and analyzed within , and in most cases it will make sense to first perform initial analyses on Redivis before downloading data (particularly for larger datasets).

However, in certain situations you may want to download some or all of the dataset contents, or its metadata, for future reference and analysis on external systems.

Download metadata

Top-level metadata about the dataset can be downloaded by clicking on the Download metadata link in the . Metadata can be downloaded in the DataCite, Schema.org, and Redivis API schema specifications.

Download citations

From the , click on the Bibliography button to view a full bibliography of your dataset and its data sources. You can copy or download this citation information in APA, CFF, or BibTex formats.

Download and Export data

Any table within a dataset can be download or exported to a supported environment, or read into another environment through the Redivis API, pursuant to any applied by the dataset's administrators.

Extract content via the API

The (and and wrappers) provide numerous methods for interfacing with a workflow and its contents, and in many cases will be the most flexible mechanism to extract workflow metadata and data from external systems.

Your account

Overview

Your Redivis account establishes your identity on Redivis, and provides a home for various forms of data-driven investigation and collaboration.

Because researchers regularly move between institutions, and often want to be able to reference and access historic work, it is strongly encouraged that you:

  • Only have a single Redivis account, much as you would only have a single Google Scholar account.

  • (e.g., university credentials, personal google account) to your Redivis account, to ensure that you can always access it.

Communication and privacy

Redivis will never, ever distribute your personal information to a third party. We will send data access notifications to your contact email, though this can be disabled in your . You may opt-in to receive occasional product updates, though this is turned off by default.

When you apply for access to an organization's datasets, they will be able to see some information pertinent to your membership in their organization:

  • Full name

  • Contact email

  • Email / institutional identifier

  • Affiliation, if provided by your institution's login provider

Creating an account

Overview

To create a new account, click the Create account button at the top of any page. You can create an account using your academic institution's login credentials (e.g., [email protected]), logging in with any google account, or via passwordless authentication through any email. Your account can then be managed from .

It is strongly encouraged that each individual has only one Redivis account, and your identity on Redivis should map to your real world identity. You can (and should) to one account — e.g., personal email, university email, visiting-scholar university email.

By creating an account on Redivis, you can gain access to hundreds of restricted datasets and powerful querying tools. There is no cost to creating an account, and you may permanently delete your account at any time.

Your account tracks all of your workflows, queries, collaborations, and data approvals with different organizations. After creating an account, you will be able to apply for membership with various organizations on Redivis and request access to their restricted data.

Unstructured files

Overview

In order to upload unstructured data to a you'll need to upload it as a on the Files tab in the dataset uploader.

It is possible to upload tabular data as a file, but it will not be represented as a . This means it is not possible to preview the variable statistics, cells, or query interfaces, use it in a transform node, or control on different levels, and is generally not recommended.

Uploading files

From the Files tab of the , click the "Upload files" button to import one or many files. You can import files from a variety of , or perform uploads

Quotas & limits

Limits for upload file size and max files per dataset are .

Folder management

When preparing your upload you will need to select a destination folder. All files must be in one folder. These folders will have a corresponding which will allow researchers to more easily work with the files in bulk. If you are uploading many files, your folder structure will likely be important – for example, a different folder for different categories of imaging files.

You can create folders on the grey bar on the right side of the Files tab in a dataset. Folder names must be unique. You can click on the ⋮ menu next to any folder name to manage that folder.

To move files between folders you can right click on an individual file to change its folder. If you have multiple folders you can click the Move files button below files in the right bar to move files between folders.

Workflows

Overview

Redivis workflows are high-performance, collaborative environments for analyzing data. In a workflow, you can combine datasets across Redivis, and build out your analysis in an iterative, reproducible manner. Workflows can contain:

  • , which bring data from across Redivis into your workflow. Data sources can either be a dataset that you have access to, or another workflow, allowing you to build on top of others' analyses.

  • , which join, filter, aggregate, and reshape tabular data. Transforms are particularly important when working with large tables, allowing you to query billions of records in seconds.

  • , which execute Python, R, Stata, or SAS in a customizable, interactive compute environment.

  • , which are created as outputs of transforms and notebooks. On any table, you can validate your results, build out further analyses analysis, create new Redivis datasets, or export the data to another environment.

Workflows are owned by either a user or , and can be with other users and organizations as you work together in real time.

New to Redivis? Learn more in our guide.

Case (if/else)

This common method utilizes if-then-else logic, assigning a result when the corresponding condition evaluates to true, otherwise assigning the final ("else") result –>

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Programmatic uploads

Overview

In addition to uploading data through the browser interface, you can leverage the and libraries, as well as the generic , to automate data ingest and data release pipelines. These libraries can be used for individual file uploads similar to the interface, as well as for streaming data ingest pipelines.

Basic examples are provided below. Consult the complete client library documentation (, ) for more details and additional examples.

exported
programmatically
embedded
previewed and downloaded
programatically
exported
Export to other environments
workflow
dataset creator
transform your data
use notebooks
Create & manage datasets
Build your own site
settings page of your workspace
workspace settings
Tie multiple emails
workspace settings
dataset
file
table
access
dataset editor
data sources
programmatically.
specified here
index table
CASE
  WHEN @condition THEN @result
  [ ... ]
  [ ELSE @else_result ]
  END

@condition

(nested) conditions – like those used in a Filter

any Redivis type

true

-

@result

variable or literal

any Redivis type

true

(Variable or value)

@else_result

variable or literal

any Redivis type

true

(Variable or value)

learn more
Analyze data in a workflow
tables
files
uploaded
workflows
granted access
overview access
create your own datasets
Create & manage datasets
your workspace
add multiple logins
Data sources
Transforms
Notebooks
Tables
organization
shared
Analyze data in a workflow

Uploading data

Overview

Getting data into Redivis is the first and most important step in making it accessible for your research community. Redivis is designed to make it easy for you to securely ingest data, at scale. To begin the data upload process, first navigate to the dataset editor.

For a guided walkthrough of uploading data to a dataset, please see the Creating a dataset guide.

For a guided walkthrough of how to clean tabular data, please see the Cleaning tabular data guide.

Redivis supports a wide variety of data types, sources, and upload methodologies. When beginning to upload data, you should first ask:

What type of data is it?

  • Tabular data

  • Geospatial data

  • Unstructured data

Where is the data?

  • A computer

  • Google Drive

  • Box

  • Google Cloud

  • AWS

  • Another table on Redivis

  • ...etc

Where is the metadata?

  • Embedded in the file

  • In another file

How do I want to perform the upload?

  • Through the browser (default)

  • Programmatically, via Python or the API

Archival & deletion

Overview

Datasets on Redivis are intended as a persistent store of information, and this persistence is critical to ensure the reproducibility of analyses using the datasets (and specific versions thereof).

In some cases, it may be necessary to archive a dataset, so as to prevent future usage and save on cloud costs. In other scenarios, you may need to delete a specific version (e.g., to reduce storage costs, or permanently retract inappropriately released data). Finally, in some situations it may be necessary to fully delete a dataset, for example, when a license to that dataset has expired.

Archival operations are always reversible, whereas deletion of versions and datasets becomes permanent after 7 days.

When a version or dataset is deleted or archived, all derivative tables in researchers' workflows that correspond to that version (or dataset) will be archived, meaning that the underlying data in these derivative tables is deleted. Note that researchers will no longer be able to read from or query these derivative tables.

This is often helpful to ensure compliance with data deletion requirements, in that all derivative data is expunged at the time of deletion or archival. However, you should be aware of the potential impact this may cause to your researchers, and communicate with them appropriately.

Dataset archival

When a dataset is marked as archived, its data can no longer be queried or modified (though its metadata can still be edited). All derivate tables in workflows will be marked as archived, which prevents reading or querying the data. This can help ensure that a dataset remains inactive and limit storage costs, though the dataset may be unarchived at any time.

To archive a dataset, either right-click on the dataset when viewing the list of your datasets, or click the Archive button on the dataset settings.

Version deletion

Any version of a dataset can be deleted, as long as it is not the currently released version. Deleting this version will delete all metadata and data associated with it.

This version will no longer be available in any workflows, and researchers will lose access to any data referencing that table in their workflows. In order to continue with the dataset, researchers will need to change their analyses to a non-deleted version.

If you delete an unreleased next version, this deletion is permanent and cannot be undone. If you delete a released version, there is a 7-day window within which the version can be undeleted, after which the deletion becomes permanent.

If you are deleting versions to reduce storage costs, be aware that Redivis stores data efficiently across versions – the storage used by a particular record will be deleted only if it is unique to the deleted version (or, if deleting a series of versions, if that record doesn't exist in any non-deleted version).

The total storage savings of deleting a version will be shown to you prior to deletion confirmation.

Dataset deletion

Once deleted, the dataset will no longer be discoverable, though it will still show up in users' workflows that reference the dataset, and bookmarked URLs and DOIs will still resolve to the dataset's landing page. You can also view a list of deleted datasets by navigating to your workspace or organization administrator panel, and filtering datasets by status: deleted.

To ensure future reproducibility, dataset metadata and documentation is preserved upon deletion. However, all data will be fully expunged, and the dataset will no longer by queryable.

The dataset's public access level will be persisted in its deleted state – meaning that if the dataset was previously visible, it will still be visible (but not discoverable) once deleted. Additionally, any users who explicitly had overview or metadata access to the dataset prior to deletion will have their access persisted upon deletion. These default access rules can be modified by navigating to the deleted dataset reconfiguring access accordingly.

Datasets can be undeleted for 7 days after deletion by navigating to the dataset settings and clicking the Undelete button. After 7 days, deletion becomes permanent and the dataset's data will be unrecoverable.

library(redivis)

# Could also create a dataset under an organization:
# dataset <- redivis$organization("your_organization")$dataset("Dataset name")
dataset <- redivis$user("your_username")$dataset("Dataset name")

# public_access_level can be one of ('none', 'overview', 'metadata', 'sample', 'data')
dataset$create(public_access_level="overview")

# Create a table on the dataset. Datasets may have multiple tables
table = (
    dataset
    $table("Table name")
    $create(description="Some description")
)

# Upload a file to the table. 
# You can create multiple uploads per table, in which case they'll be appended together.
upload = table$upload()$create(
    "./data.csv",           # Path to file, data.frame, raw vector, etc
    type="delimited",       # Inferred from file extension if not provided
    ...                     # See documentation for all parameters
)

# Optional: add more uploads to this table, or create other tables.
# Multiple uploads on the same table will be appended together.

# Release version
dataset$release()
redivis-python
REST API
import redivis

# Could also create a dataset under an organization:
# dataset = redivis.organization("your_organization").dataset("Dataset name")
dataset = redivis.user("your_username").dataset("Dataset name")

# public_access_level can be one of ('none', 'overview', 'metadata', 'sample', 'data')
dataset.create(public_access_level="overview")

# Create a table on the dataset. Datasets may have multiple tables
table = (
    dataset
    .table("Table name")
    .create(description="Some description")
)

# Upload a file to the table:
upload = table.upload().create(
    "./data.csv",           # Path to file, data frame, raw bytes, etc
    type="delimited",       # Inferred from file extension if not provided
    ...                     # See documentation for all parameters
)

# Optional: add more uploads to this table, or create other tables.
# Multiple uploads on the same table will be appended together.

# Release version
dataset.release()
FAIR
Findable
Accessible
Interoperable
Reusable
DOIs
Provenance
dataset
metadata
documentation
account
institutional login credentials
Analysis tools
download
version
controlled
Workflows
fork
workflows
dataset overview
dataset overview
export restrictions
See the full documentation on exporting tables to learn more.

Open access

Overview

The ability to view, reproduce, and build upon other works is a core tenet of the scientific process that Redivis is built to uphold and facilitate. Redivis tools for data storage and analysis comply with federal mandates for open access and automatically generate artifacts documenting all steps in the data process, ensuring reproducibility.

Background

In August 2022, the United States Office of Science and Technology Policy (OSTP) released a memorandum on public access to federally funded research (known as “Nelson Memo”) which outlined new requirements affecting both faculty and students who conduct research using federal funding:

  1. Make publications and their supporting data resulting from federally funded research publicly accessible without an embargo on their free and public release.

  2. Enact transparent procedures that ensure scientific and research integrity is maintained in public access policies.

  3. Ensure equitable delivery of federally funded research results and data.

1. Public accessibility

Redivis is built to make data sharing as easy and safe as possible, defaulting to fully accessible data, metadata, and documentation. However, some data must be restricted for privacy and to comply with data licensing agreements. In these cases Redivis follows an "as much as securely possible" sharing philosophy for sharing.

  • Multiple levels of access allows data owners to grant access to documentation, metadata, samples, and data separately, so as much access as possible can be granted.

  • When research is done using restricted data, analysis code is made available when possible.

  • Viewing public data and analysis workflows on Redivis is completely public and does not require a Redivis account.

  • If any piece of analysis or data is restricted, a user will need a Redivis account to apply for access. Redivis accounts are completely free with no restrictions.

2. Transparent procedures

Redivis systems are built to automatically capture and document any work done on datasets and analyses in standardized formats, while giving researchers and administrators the ability to supplement or override these when necessary.

  • Datasets are automatically versioned. All changes to variables and rows are recorded and made available to administrators and data viewers.

  • Redivis offers a no-code interface to build transform queries that compiles to SQL code. This code is available to the analyst and any viewers.

  • Workflows are self-documenting. Every step taken in an analysis workflow is recorded sequentially in a visual format that is easy to follow.

  • Workflows are version controlled. Every step is versioned and time-stamped, allowing users to revert to a previous iteration, or a viewer to understand how queries might have changed over time.

  • Workflow can be forked, preserving relationships between workflows and datasets.

3. Equitable delivery

Redivis is free and accessible to any researcher, reviewer, or casual data browser. By taking multiple complex data storage and analysis systems and centralizing them with clear UI, it is clearer to understand full workflows without specific technical knowledge.

  • Redivis uses transparent, open-source formats, allowing data, analyses, and workflows to be exported and used within other systems.

  • make data and workflows available to researchers and administrators using their own systems.

  • No-code interfaces allow researchers at any skill level to build and understand research steps.

  • Common, open-source coding languages (SQL, Python, R) are a foundation of tools and supported in analysis workflows.

  • There is no cost to researchers to store personal data, or run data queries.

Grant access to data

Overview

One of the strengths of Redivis is that researchers can apply for, and gain access to, data on the same page as where they discover the dataset. Moreover, all access applications are tracked, both for administrator's reference and to provide transparency and timely notifications to researchers.

Responding to access requests in a timely manner creates a great environment for researchers and takes full advantage of these automatic systems.

1. Configure notifications

You can configure multiple email addresses to receive an email when a request for access is made to your organization, as well as the frequency you receive them. Especially if you are a smaller organization that isn't getting access requests every day, we very strongly recommend that you have at least one administrator receiving emails so you know when to respond to a request.

Learn more in the Settings reference section.

2. Approve membership requests

Users must be approved as members of your organization before any other access requests can be approved.

When a membership request is pending you will see an Alert (!) on the Members tab of your workspace. Click on any member with an alert to see their pending submission awaiting approval.

Pending members will have an Approve / Reject button as soon as you click on their page, alongside the authentication information they submitted when requesting to be a member. It is important to verify that this person has a genuine email that seems consistent with who they say they are.

3. Approve requirement submissions

As users ask for approval you will receive an alert (!) in the administrator panel where your attention is needed.

Member requirements and export restriction exceptions will appear on the Member tab.

Study requirements will appear on the Studies tab.

Direct access requests will appear on the Datasets tab.

When all of the pending requests are resolved, the alert will disappear.

For each request you will have the option to approve or reject the submission. You can always come in later to revoke the approval if needed. You might also see that the member has submitted updates to their existing approval. In that case the approval is still valid while updates are pending, unless it is otherwise revoked or expired.

Learn more in the Approving access reference section.

4. Leave comments for clarity

Both administrators and members have the option to leave a comment on any requirement submission. The act of leaving a comment will trigger an alert for administrators, or a notification for members.

Especially when you are rejecting or revoking a requirement submission, we recommend leaving a comment to let the member know what they did wrong, or what they still need to do to gain access.

5. Audit access

It's often a good idea to double check which researchers have access to your datasets.

In order to audit a particular member's access to your organization's datasets, double click on that member, and navigate to the Access overview tab. This will list all of your organization's datasets, and the member's corresponding access to each.

In order to audit all members that have access to a particular dataset, click the filter icon to the right of the member search bar. This will allow you to filter all members by their access to, and usage of, your organization's datasets.

Next steps

Expand your reach

If you are part of a larger institution, you can contact us about getting an institution-wide Redivis page to help users discover new organizations and datasets.

Generate a report

Overview

All activity surrounding your organization and it's data usage is logged and accessible to organization administrators. This can be helpful for year-end reports or any potential data security concerns!

You can easily organize, visualize, and download any of your organization's information by generating a report in your administrator panel.

1. Create a report

Navigate to the Reports tab of your administrator panel and click the + New report button.

In the future you can return to this page to see your existing reports and create duplicates to further edit.

2. Choose the entity of the report

Start by choosing the entity, or topic, of the report. Whatever you choose here will define what each entry in your report represents.

You can choose a Resource such as Members, Workflows, Studies, or Datasets which will allow you to get a sense of the named entities your organization owns or has interacted with.

Or you can choose a Usage event such as Queries, Notebooks, Exports (or all of these combined) which will allow you to see how your organization's data has been used.

3. Define constraints

From here, define the other criteria of your report:

Time frame

If you choose a usage event as your entity you will need to set a time frame to define which events will be included. (Resources don't have a time associated with them since they are not events.)

Aggregation

If you chose a usage event you can optionally choose to aggregate events based on a resource or period of time. Perhaps you want to see all queries grouped by member, or want to see all notebooks per day. You can also create custom aggregation groups by using member labels and dataset labels with colons. E.g. labeling members with school:medicine and school:business would allow you to aggregate a report by school.

Filter

You can additionally choose a filter that will restrict your report to only events or resources that meet your chosen criteria. For example, if you want to see all usage events that referenced a particular dataset.

Learn more in the Reports reference section.

4. Define the report's fields

Finally, you'll need to select what fields or topics are present in the report. Every field selected here will become a column in your resulting report table. Different fields are available based on what entity you have selected.

Some fields are selected by default that might be helpful but you can make your own selections. For example, maybe you have a report of data usage and you want to include the average number of tables referenced in each notebook session, or the max amount of compute used in queries per day.

If there are any fields or other report information you would like but don't see available in the interface, please contact us to let us know!

5. View the report results

Click the Create button to generate this report! This will generate a table with entries that match all the criteria you've defined above. You can browse this table to understand the data and make sure it looks how you'd expect.

If your report contains any numeric data you can click the Visualize tab to see a visual representation of 1-2 of your numeric fields. The type of graph will be chosen by default based on the shape of your data. You can hover on this graphic to learn more about each entry.

5. Iterate and export

You can update this report by clicking the Edit button. Changing any criteria will refresh the table and visual with new contents based on your new criteria.

This report is now saved in your organization's reports and you can access it in the future.

Any time you edit the report or come back to it after more than one hour has elapsed, the table will be automatically regenerated. Based on the time frame you've selected this might drastically change the contents.

If you'd like to work with the data in a different environment you can download the report table as a csv by clicking to Download button.

Next steps

Expand your reach

If you are part of a larger institution, you can contact us about getting an institution-wide Redivis page to help users discover new organizations and datasets.

External sources

By default, you may upload data from your local computer, a public URL, or from another dataset or workflow on Redivis. However, Redivis supports numerous integrations for data ingest across common sources. You'll need to enable data sources in your account workspace settings in order to import data they contain.

When you the enable the data source, you'll be prompted to log into the corresponding account. Redivis will only ever read data from these sources when explicitly requested, and it will never modify or overwrite content.

Once configured, you'll see any added data sources appear as an option when uploading data.

Available data sources

Google Cloud Storage

You may import any object that you have read access to in GCS by specifying a bucket name and path to that object, in the form /my-bucket/path/to/file. You may import multiple objects at once by providing a prefix followed by wildcard characters, e.g.: /my-bucket/my-folder/* .

The following wildcard characters are supported:

  • * : Match any number of characters within the current directory level. For example, /my-bucket/my-folder/d* matches my-folder/data.csv , but not my-folder/data/text.csv

  • ** : Match any number of characters across directory boundaries. For example, my-folder/d** will match both examples provided above

  • ? : Match a single character. For example, /my-bucket/da??.csv matches /my-bucket/data.csv

  • [chars] : Match any of the specified characters once. For example, /my-bucket/[aeiou].csv matches any of the vowel characters followed by .csv

  • [char range] : Match any of the range of characters once. For example, /my-bucket/[0-9].csv matches any number followed by .csv

Amazon S3

You may import any object that you have read access to in S3 by specifying a bucket name and path to that object, in the form /my-bucket/path/to/file. You may import multiple objects at once by providing a prefix followed by a wildcard character, following the same syntax and rules as outlined for Google Cloud Storage above.

Google Drive

You may import any file of valid format that you have stored within your Drive, including Google Sheets. Upon choosing as your import source, a modal will open that will allow you to browse and select files from your Google Drive.

Google BigQuery

You may import any table that you have read access to in BigQuery, including views, materialized views, and external tables. You must specify the table in the form project.dataset.table . To import multiple tables within a dataset, you may use wildcards. E.g., project.dataset.* or project.dataset.prefix* .

Box

You may import any file of valid format that you have stored within Box. Upon choosing as your import source, a modal will open that will allow you to browse and select files from your Box.

OneDrive

Coming soon. Please contact [email protected] if this integration would be helpful for your use case so that we can prioritize.

Redivis

You can import any table, which can be particularly helpful with ETL workloads where you want to import a cleaned version of your data (example). You can also import any uploaded files into your table, supporting workflows where tabular data is initially loaded as a file before being loaded into a table.

You must both have data access and the ability to export any table or file that you import.

Importing tables

You can reference any table on Redivis using the form user|organization.dataset|workflow.table. That is, specify its owner (a user or organization), its containing entity (a dataset or worfklow), as well as the table name, separated by periods.

Importing files

All files on Redivis belong to a file index table. To import a file, first specify the index table, followed by a forward slash (/), and then the file name. E.g.:

user_name.dataset_name.table_name/file_name.csv

To import multiple files at once, you can use wildcard characters, following the same pattern rules as specified for Google Cloud Storage above. E.g.:

user.dataset.table/prefix*.csv

Sampling

Overview

For datasets with large tables, it is often a good idea to include a 1% sample of the data, supporting faster exploratory queries as new researchers work to understand your data. Moreover, if a sample is configured, you will have the ability to control access to that sample separately from the full dataset.

Sampling is applied independently to each version of a dataset. You may modify the sampling methodology on a version at any time — even after it's been released — though keep in mind that this may affect researchers that are currently working with the dataset sample. As a best practice, it's good to configure and validate your sample before releasing a new version.

To configure sampling on your dataset, click the Configure sample button on the Tables tab of a dataset page.

Random sample

The simplest form of sampling, this will create a corresponding sample for every table in the dataset (including file index tables). Every record / file will have a 1% chance of occurring in the sample.

As a general rule, you should only use random samples if:

  • You have one table in your dataset, or

  • Researchers won't be joining multiple tables in your dataset together

If this isn't the case, consider sampling on a specific variable. Otherwise, as researchers join different tables together, they will start getting samples of a sample, since there is no consistent cohort of records between tables.

Sampling on a variable

For situations when you want researchers to be able to join tables within your dataset, consider generating a sample on a variable that exists in at least some of the tables in your dataset. Every value for this variable will have a 1% chance of being in the output set.

Importantly, this sampling is deterministic. This guarantees that the same values that fall in the 1% sample for one table will also occur in the 1% sample for another table in the same dataset. In fact, these sampled values will be consistent across Redivis, allowing researchers to even merge samples across datasets.

Note that the sample will be computed on the string representation of the variable. For example, if the value '1234' falls in the 1% sample, then we are guaranteed that the integer value 1234 will also fall within the sample. However, if this value is stored as a float (1234.0), it is unlikely to also fall in the sample, as the string representation of this float is '1234.0', which for the purposes of sampling is entirely different than the string '1234'.

When sampling on a variable, only tables with that variable will be sampled. This is useful for the case when some tables contain supplementary information to your primary cohort. For example, consider the case when your dataset has a "Patients" table, a "Hospitalizations" table, and a "Hospitals" table. We'd likely want to create a sample on the patient_id variable, which would create a 1% subset of patients and the corresponding hospitalizations for those patients. However, this wouldn't create a sample on the "Hospitals" table — which is what we want, given that the sample of patients is still distributed across a large number of hospitals.

If your dataset contains unstructured data files, you probably want to sample on either the file_name or file_id variables.

If only some of the dataset's tables are sampled, users with sample access to the dataset will have data access to the sampled tables and data access to the unsampled tables. While this is likely necessary for researchers to meaningfully work with the dataset sample (see paragraph above), it may have ramifications for how you configure your access rules.

Learn more about controlling sample access in the data access reference.

Step: Limit

Overview

The Limit step reduces the table to a set number of rows.

Example starting data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Limit to 2 rows

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 *---------+--------*/

Step structure

  • There will be one limit block where you will define your limit.

Field definitions

Limit

The number of rows you would like your output data to contain.

The limit step is non-deterministic if the result set isn't ordered. Different rows may be returned from subsequent queries.

Example

Let's say we are doing initial exploratory work on data. We did a number of complex steps and want to quickly execute the query so we can iterate.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Numbering

Cumulative distribution

Return the relative rank of a row defined as NP/NR. NP is defined to be the number of rows that either precede or are peers with the current row. NR is the number of rows in the partition. –> learn more

CUME_DIST()

Return type

float

Parameters

Dense rank

Returns the ordinal (1-based) rank of each row within the ordered partition. All peer rows receive the same rank value, and the subsequent rank value is incremented by one. –> learn more

DENSE_RANK()

Return type

integer

Parameters

N-tile

Divides the rows into a set number of buckets based on row ordering and returns the 1-based bucket number that is assigned to each row. –> learn more

NTILE(@literal)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@literal

any

true

(integer)

Percent rank

Return the percentile rank of a row defined as (RK-1)/(NR-1), where RK is the RANK of the row and NR is the number of rows in the partition. –> learn more

PERCENT_RANK()

Return type

float

Parameters

Rank

Returns the ordinal (1-based) rank of each row within the ordered partition. All peer rows receive the same rank value. The next row or set of peer rows receives a rank value which increments by the number of peers with the previous rank value –> learn more

RANK()

Return type

integer

Parameters

Row number

Returns the sequential row ordinal (1-based) of each row for each ordered partition. If the no order condition is specified then the result is non-deterministic. –> learn more

ROW_NUMBER()

Return type

integer

Parameters

Emailing subsets of members

Redivis has some built-in methods for helping you find and manage all of your organization's members, workflows, datasets, and access systems that you can use to ease your workflows.

In this example we will imagine that we need to get in touch with everyone working with a specific dataset.

1. Locate the dataset in the administrator panel

We can go to the Datasets tab of the panel and search or filter to locate our dataset.

2. Find relevant workflows

We can right click on this dataset and hover on View related to see all of the options relating to this dataset. In this case we want to see all workflows that have used this dataset.

Clicking this option navigates us to the Workflows tab, with a filter active for this dataset. This list is showing all workflows that contain this dataset.

3. See related members

Click the Select all checkbox to select all of these datasets. With these selected we can right click on any of them (or click the Actions button that appears in the top right) to View related activity.

In this menu we should select Members in workflows. This will navigate us to the Members tab filtered by all members who are owners or collaborators on the workflows we had selected.

4. Show emails

Now that we have narrowed it down to the members we are interested in, we can right click on the header (or the options menu in the top right of the list) to edit the columns shown. In this menu we can add the Contact email column.

The quickest way to get this information out of Redivis is to go back to that same menu and click Download CSV to begin downloading a CSV of the list we are looking at exactly as you've configured it (all columns, filters, and sorting will be preserved).

Next steps

We can use these similar steps to view:

  • Queries run on a specific set of datasets

  • Exports made from a workflow

  • Requirements a member has filled out

And many more!

JSON

JSON query

Extracts a JSON value and converts it to a SQL JSON-formatted STRING. –> learn more

JSON_QUERY(@expression, @expression_2)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@expression_2

any

true

-

JSON value

Extracts a JSON scalar value and converts it to a SQL JSON-formatted STRING, removing outermost quotes and un-escaping return values –> learn more

JSON_VALUE(@expression, @expression_2)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@expression_2

any

false

($)

Events and press
Example workflows

Citing & Describing Redivis

Citing Redivis

Publications and other outputs should cite Redivis as research software, using the appropriate conventions for your publication medium.

  • The Digital Object Identifier (DOI) for Redivis is:

  • The canonical URL for Redivis is:

If referencing Redivis inline, the abbreviated citation can be used:

Redivis [https://doi.org/10.71778/V2DW-7A53]

The following BibTeX entry can be used:

These identifiers are also associated with Redivis, though do not need to be included in all citations:

  • Re3data identifier:

  • RRID:

Describing Redivis in grants and publications

We provide the language below to help you describe Redivis in your publication or grant application. It may be adapted and modified as needed. No attribution is necessary when using or modifying this content.

Overview

Redivis is a secure, scalable, cloud-based data platform developed to meet the needs of academic research. Redivis was first developed in collaboration with the and is now deployed across a number of leading research institutions [1], supporting the distribution and analysis of large, sensitive datasets across multiple disciplines in alignment with FAIR data practices.

Users of Redivis are able to upload large-scale numeric, text, structured, and unstructured data directly through the browser or via APIs and other integrations. Built-in tools are available to curate and tag rich metadata, maximizing the shareability of datasets. The platform allows for robust search and exploration across multiple datasets and their metadata and variables.

Additionally, Redivis provides a rich toolset for analysis and exploration. Users can filter, merge, analyze and visualize billions of records in real-time, and can easily bring together disparate datasets to answer novel questions. They can leverage a massively-parallelized architecture to execute SQL queries (either composed as code or through a graphical user interface), in addition to customizable Jupyter notebooks running Python, R, Stata, and SAS. REST APIs give users and applications additional programmatic interfaces to the data, ensuring interoperability with other tools and ecosystems.

Technical infrastructure

The platform is built on Google Cloud Platform infrastructure using open-source software. The main application runs in containerized services orchestrated with Kubernetes. It integrates high-performance tools including Google BigQuery and its ANSI-SQL interface for large-scale tabular data processing and JupyterLab for interactive analytics. Researchers may provision environments preconfigured for R, Python, SAS, and Stata, with support for customizable environments. Compute capacity scales dynamically to meet workload demands, with configurations available up to 416 CPUs, 11.5 TB of RAM, and 16 NVIDIA A100 GPUs, facilitating complex statistical and machine learning workflows on terabyte-scale datasets.

Reproducibility

Redivis is designed such that reproducibility is an automatic byproduct of researchers' use of the platform. A novel version control system for datasets enables efficient data updates without duplication, supporting full reproducibility and cost-efficient storage management. All analytical activity — including code, queries, and derivative outputs — is tracked and recoverable, ensuring transparency and compliance with NIH data-sharing policies.

All datasets, workflows, and versions thereof can be assigned a unique Digital Object Identifier (DOI), allowing for researchers to persistently link to a fully-reproducible artifact of their research. Future investigators, assuming they have appropriate access to the underlying data, can then re-run these analyses and produce identical results, in turn modifying and building upon prior works.

Security

Redivis supports rigorous data governance with a tiered attribute-based access control (ABAC) system, customizable user agreements, egress restrictions, and an intuitive, searchable audit trail.

Redivis has been audited and approved for the use of FERPA, PII, PHI, and HIPAA data. The platform is SOC2 and NIST 800-171 (rev 3) compliant and undergoes regular audits and penetration testing. All data and metadata are stored in a multi-redundant, AES256 encrypted datastore. All connections to the platform are over an encrypted TLS 1.2 or greater protocol. User login is managed by single sign-on via and, allowing users and collaborators to use their institution credentials to authenticate. The platform supports HTTP strict transport security (HSTS) and is on the for all major browsers, preventing users from establishing an unencrypted connection.

Data administrators on Redivis have access to detailed, searchable audit logs, and reports can easily be generated for auditability and traceability. The platform allows for the restriction of data exports and downloads as well as the automatic expiration of dataset access. For highly sensitive data, multi-layer protection exists to prevent accidental data sharing, viewing, and downloading. Redivis also allows for the complete deletion of data, including the automatic and instantaneous deletion of all data derivatives, as needed.

—

[1] , , , , ,

Discover & access data

Overview

Most data on Redivis is uploaded through an Organization, and that is the best place to find Datasets to work with. Some datasets will be public, while others will require certain steps before you can get full access to the data.

1. Find datasets

From an Organization's home page, click on the Datasets tab to browse all datasets and filter by their metadata.

You can use filters on the left bar to narrow down your results. All searches perform a full-text search across a dataset, its documentation, tables, variables, and rich metadata content.

If your organization is part of an institution (such as or ) you can also go to the institution page to search all datasets in your institution.

Not finding what you need? You can also to Redivis through your workspace to augment a research project you're working on and to share with other researchers. Or if you represent a research group, center, or institution, about setting up your own Organization page.

2. Get to know a dataset

Click on any dataset title to go to that Dataset page.

You can view the metadata (including an abstract, documentation, citation information, and a version history) and dig into the data directly on the Tables and/or Files tab.

Each table contains a data browser to view cells, generate summary statistics, and display metadata for this table and each variable. Each file can be previewed and downloaded. Both tables and files can be added to workflows for analysis.

Many datasets on Redivis are public, while others have requirements for certain levels of access enforced by the data owner.

On the right side of the dataset, you can see your current access level. If you have Metadata access you can see the variable information and summary statistics, but you'll need to gain Data access in order to view the data contents or work with this data.

Learn more in the guide.

3. Apply for access

On any restricted dataset you can click the Apply for access button in the top right of the page to see a list of steps required for you to gain different access levels to this dataset.

The first step is to become a member of this organization. When you go to apply you will be prompted to become a member of this organization using the credentials on your account.

Once you're a member of this organization, you might need to fill out some requirements that administrators have set up, or request access directly from them.

Requirements are forms that you will need to complete on this page and then submit. Once these are approved by your organization's administrators you will gain access to the dataset.

Learn more in the guide.

Next steps

Start working with your data

Once you have data you're interested in, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Learn more in the guide.

Upload your own datasets

Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).

Learn more in the guide.

Apply to access restricted data

Overview

Many datasets on Redivis are public, while others have requirements for certain levels of access enforced by the data owner. If the dataset is public, you can skip over this section, otherwise you'll need to gain access before you can fully utilize the data.

1. View access rules

On the right side of all restricted dataset pages is a box which displays your current access level. If this doesn't say Data access, then your next step will be to apply for access.

Click the Apply for access button on the top of the page to open the access modal. This modal allows you to manage your access to this particular dataset throughout the platform.

Access levels

Dataset access has five levels:

  1. Overview: the ability to see a dataset and its documentation.

  2. Metadata: the ability to view variable names and summary statistics.

  3. Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.

  4. Data: the ability to view and query a dataset's tables, and work with them in .

  5. Edit: the ability to edit the dataset and release new versions.

Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.

Usage rules

Usage rules are restrictions placed on the way you can use this data once you've gained access. For example, data exports may be restricted to certain export environments, and/or require administrator approval.

While no action is required when you first apply for access a dataset, it's helpful to understand these restrictions to better understand what limitations, if any, they might pose to your research.

2. Apply for access

The access modal for the dataset will include all steps designated by this dataset's administrators to gain access to work with data.

Membership

To apply for access to datasets hosted by a Redivis organization, you will first need to Join the organization.

To join an organization, you will need to provide information about your identity via an email authentication, usually the email associated with your Redivis account.

After submitting your membership application, it will either be automatically approved or will be marked as pending while an administrator reviews your submission. You'll receive a notification if your submission is approved or rejected.

Requirements

Some datasets have requirements as part of their access applications. Requirements are global to an organization and usually contain some type of form. Once all requirements for a particular level of access are approved, you will gain access to the relevant dataset(s).

In the access modal, click the Apply button on any requirement to fill out and submit the form.

When you submit a requirement, an administrator of the organization will be alerted and can review your submission for approval (in some cases, the requirement may be configured for auto-approval upon submission). You will receive a notification if your submission is rejected or approved, or in the future when it's about the expire. You can also leave a comment for an organization's administrators here, and they can reply to you.

There are two types of requirements - those you fill out on behalf of yourself (member requirements) and those you fill out on behalf of a (study requirements). If the dataset you are interested in has a study requirement, you'll need to create a study and select it in the access modal in order to submit the requirement. Additionally, any workflows that utilize this dataset will need to be a part of the approved study.

Direct access

Sometimes you don't need to fill out any requirements and need to simply Request access at a specific level. You can do so in this modal, and will receive a notification when you are granted access by the dataset owner.

3. Wait for data owner approval

For some membership and requirement submissions (and all direct access requests), you will need to wait for the data owner to respond to your submission. If the dataset is owned by a user, they will need to approve your request; if it is owned by an organization, any administrator at that organization can approve it.

The data owner will receive a notification as soon as the requirement is submitted, and you will receive a notification when it is approved, rejected, or for any requests for additional information. By default, you will receive an email as well as a notification within Redivis. You can customize this notification behavior on your .

4. Re-authenticate (if necessary)

If you are accessing restricted data hosted by an organization, you may have to re-authenticate after a period of inactivity anytime you perform an action governed by the "Data" access level.

After a period of inactivity (set between 15 minutes and 3 days by the organization administrator), a popup will ask you to re-authenticate anytime you view data, query tables in a transform, or export data from Redivis.

Next steps

Start working with your data

Add this dataset to a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Learn more in the guide.

Metadata

Overview

Redivis determines variable names and types during . Additionally, it will automatically parse certain metadata based on the uploaded file format:

  • SAS (.sas7bdat): labels

  • Stata (.dta): labels and value labels

  • SPSS (.sav): labels and value labels

For other file types (e.g., csv), you will need to augment the metadata directly. To apply metadata in bulk, you can upload a file containing metadata information directly from your computer. This file can either be a CSV or JSON.

Is your metadata stuck in a PDF? We're truly sorry — if you can, please let the data provider know that it is essential that they provide metadata in a machine-readable format; hopefully in time this will change.

While you can just upload the PDF to the dataset's , you'll be doing your researchers a huge service if you can add structured metadata to the variables. That might mean some manual copying and pasting from the PDF, or you could consider the various (and imperfect) online PDF to CSV conversion tools, or .

If you don't have the bandwidth, consider asking for your researchers to contribute by making them a .

CSV metadata format

The CSV should be formatted without a header, with each row corresponding to a variable, with column 1 as the name, 2 as the label, 3 as the description. If the variable doesn't have a label or description, leave these columns empty.

For example:

JSON metadata format

When uploading a JSON file, specify the name, label, description, and valueLabels using the appropriately named attributes in the object corresponding to each variable. If the variable doesn't have a label, description, or value labels, you don't need to include these attributes.

For example:

To upload value labels in bulk, you must use the JSON format. We no longer support bulk upload of value labels via CSV.

Discover datasets

Overview

Datasets are the core entity on Redivis, and finding the dataset(s) you want to work with is generally one of your first steps. All datasets are either uploaded by an organization or a user. If a user has uploaded a dataset they will need to share it with you directly, but you can browse datasets upload by any organization on their home page.

1. Find your organization

There are many ways to find your organization. If you know what you're looking for you can type it into the "Go to" bar on the top of the page. To browse all organizations you can go to the Organizations tab of your workspace where you can Find organization.

If your organization is part of an institution (such as or ) you can also go to the institution page to browse all organizations in that institution.

Note that some organizations may be hidden from public view, in which case you will need to contact someone at the organization to get the direct link.

2. Browse available datasets

On the organization home page you can see all of the organization's datasets that you have at least Overview access to. If you're looking for a dataset you expect to be here and you don't see it, you might need to contact the organization administrators to get access.

You can see the datasets that this organization has featured on the left section of the page, type a term into the search bar, or click the Datasets tab to browse everything. By default these datasets are sorted by popularity but you can reorder them however you'd like to find what you want.

You can also click the Workflows tab to see workflows featured by this organization which showcase how their data can be used. More on workflows later!

3. Search for data

Once on the dataset tab you can use multiple filters along with search terms to find what you'd like. All searches take a comprehensive look at information from the dataset documentation as well as variable metadata, including value labels.

You can also use the filters to find specific information you're looking for. If you know the name of the variable or year of the data you're interested in, narrow your search here.

4. Preview tables

Each dataset that appears here matches all search and filter configurations. You can click on the dataset tile to see more information, including a preview of all of the tables this dataset contains.

Click on any of these tables to dig into it further. As long as you have metadata access you can see the variable information and summary statistics.

Once you find a dataset you want to explore further, click on the title to go to the dataset page.

5. Explore the dataset page

This dataset page has all of the information you'll need to familiarize yourself with the data and get started working with it further. Add this dataset to your dataset library to easily find it later on by clicking the button alongside the dataset title.

Overview

This tab has a short summary of the dataset along with different sections filled out by the administrators such as the Methodology, Documentation, and Provenance information. This Provenance section will also have citation information and a DOI if one has been created. You can also see your current access level on the right side of the page.

Tables

This tab contains all of the dataset's tabular data. You can see each table alongside information about its contents. Click on each one to dig deeper into the data.

Files

This tab contains all of the dataset's unstructured data. You can preview each file by clicking on it.

Usage

This tab has information about how other people have used this dataset on Redivis. You can see the most popular variables across all tables to get a sense for which ones might be a good starting place to understanding the structure of the data.

Next steps

Apply for access to restricted data

If you don't have data access to this dataset you'll need to apply for access before you can work with it further

Learn more in the guide.

Start working with your data

Add this dataset to a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Learn more in the guide.

Tables

Overview

Tables are brought into a workflow via its , and are then created as the output of various and in the workflow. Tables often serve to materialize intermediary results, sanity check an output, as well as final data derivative representing the culmination of your analysis pipeline.

Workflow tables follow the behavior and functionality of .

Usage in workflows

All table nodes have one upstream parent, either a data source, a transform, or a notebook.

On each table, you can create any number of transforms or notebooks, which then reference that table as their "source". Tables can also be joined into transforms and notebooks that do not have the table as their primary source.

Table node states

As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress.

State
Display
Details

Single Sign-On (SSO)

Overview

When working with non-public data on Redivis, it's often important to be able to authoritatively attest to their affiliation with a given institution or other entity. To this end, Redivis supports Single Sign-On (SSO) through most academic institutions, as well as the ability to establish identity through a Google account or validated email address.

Institution SSO via SAML

Redivis is a registered service provider within the US-based , which in turn is part of the , enabling secure, authoritative SSO across thousands of universities around the world, via the SAML 2.0 protocol. If you are a member of an academic institution (as well as certain other research enterprises), you can search for your institution by name and log in to Redivis through your institution's sign-in page.

Troubleshooting institution SSO

In most cases, logging in with your institution will "just work". However, due to inconsistencies in how certain standards are applied around the world, you may run into issues when logging in through your institution. These issues can often be resolved with a quick ticket with your IT support desk – we recommend that you direct them to this page and copy [email protected] so that we may provide further technical information if needed.

Some common issues are outlined below:

Your institution does not support the Redivis service provider

If, when choosing your institution to log in, you are immediately presented with an error page (before you can type in your password), this likely means that your institution needs to add Redivis to some sort of "service provider allowlist". As a registered service provider within InCommon / eduGAIN, most institutions will automatically accept login request from Redivis – but some require manual configuration. In this case, your IT desk will need to take a quick action to enable Redivis – it will likely be helpful to direct them to Redivis's SAML metadata, found here:

Redivis was unable to determine identity

This error will occur after you've logged in with your institution, upon being redirected back to Redivis. In this case, the authentication request completed successfully, but your institution didn't provide enough information for Redivis to know who you are (which is important in order for you to apply for restricted data, so that the data distributor can be confident of who they're granting access to!).

Some institutions allow you to configure privacy options associated with your login. If this is the case, navigate to the appropriate settings page within your institutional account, and make sure that your name, email, and institutional identifier / username are released.

Redivis requires all institution identity providers to provide some minimal information about the individual, such as name, email, and a persistent identifier. These are codified as the "". If your institution uses OpenAthens for SSO, you can to learn more about releasing these attributes.

For identity provider administrators

Redivis requires the following attributes:

  • eduPersonPrincipalName (urn:oid:1.3.6.1.4.1.5923.1.1.1.6)

  • email (urn:oid:0.9.2342.19200300.100.1.3)

  • name (urn:oid:2.16.840.1.113730.3.1.241)

The following attributes are optional but encouraged if available:

  • affiliation (urn:oid:1.3.6.1.4.1.5923.1.1.1.1) or scopedAffiliation (urn:oid:1.3.6.1.4.1.5923.1.1.1.9)

  • orcid (urn:oid:1.3.6.1.4.1.5923.1.1.1.16)

  • pairwiseId (urn:oasis:names:tc:SAML:attribute:pairwise-id)

  • eduPersonTargetedId (urn:oid:1.3.6.1.4.1.5923.1.1.1.10)

Other error messages

While uncommon, it's certainly possible that other errors might occur when logging in through your institutional credentials. If you do, please contact [email protected] and we'd be happy to help you troubleshoot.

SSO via Google

Redivis also supports the ability to sign in via any Google account. This can be a personal gmail account, or via your organization if it supports Google single sign-on. When you sign in with Google, your name, email, and an opaque persistent identifier will be shared with Redivis.

If your institution supports Google sign-on, but is also listed as a SAML identity provider (see above), the SAML SSO will be preferred. If you try logging in via Google, you will be redirected to your institution's login page.

Email sign-on

If your institution isn't listed and doesn't support SSO through Google (e.g., many @.gov emails), you can also sign in via any email address.

Redivis will send a unique code to this email every time you log in, making it such that the account owner continuously "proves" their ownership of the given email address.

For security purposes, you must enter the code sent to your email in the same window from which it was initially requested. If you want to log in from a new window / device, you can request a new code.

Export to other environments

Overview

Redivis workflows contain powerful tools to reshape and analyze data but if you prefer to export data into a different workflow you can easily do so. Redivis systems use open source tools and common formats to make the transition as easy as possible.

1. Check export restrictions

Some datasets on Redivis have export restrictions which prohibit or limit removing data from the system.

You can check this by going to the dataset page and clicking the top right button Manage access, or by right-clicking on a dataset in a workflow and selecting the View access option.

The bottom section of this access modal defines any export restrictions in place. It might not have any restrictions, might be completely restricted, or might have some limited options for export.

If there are limited options for export, this section will detail the available locations and any restrictions on using those (such as the size of the table being exported). These restrictions are enforced automatically when you try to take actions within the system.

If there are no options for export, you still have the option to work with your data in a Redivis workflow, where you can reshape and clean your data using transforms, and then analyze it in notebooks.

Learn more in the guide.

2. Prepare your data

If you would like to export the entire table, you can skip this step.

Otherwise, we recommend that you use transforms to cut your data down to a smaller size and reshape it into the table format you need for analysis before exporting it. Especially for large datasets, Redivis tools are created specifically for these purposes and work seamlessly in the browser with no additional setup.

Learn more in the guide.

3. Export data

You can open the Export modal to initiate an export from any table. You can do this on the dataset page by right clicking a table and selecting the Export menu option, or by right clicking on any table in a workflow and selecting the same option.

Here you can see all the options available for your table to export.

If your table has export restrictions set by the data owner, options will be disabled here. In the case where your table does not meet export requirements but the data owner allows exception requests, you will see the option to do so here.

Download

The first tab of this modal gives you different options for formats to download your data. Most common data formats are supported, including csv, json, avro, parquet, SAS, Stata, and SPSS. Select the format you'd like and click the Download button to start downloading the file.

Learn more in the reference section.

Programmatic reference

You can reference this file from your computer or another computational environment using the Redivis Python and R libraries. This modal gives specific information on how to reference this dataset, and you can reference our docs for other options.

Integrations: Google Looker Studio

This is a free Google dashboard visualization program you can directly link your table to. Use their point and click interface to build common visuals that will update as the underlying table updates.

Learn more in the reference section.

Integrations: Google Cloud Storage

You can export your table directly into a Google Cloud Storage bucket that you have access to.

Learn more in the reference section.

Integrations: Google BigQuery

You can export your table directly into a Google BigQuery project that you have access to.

Learn more in the reference section.

Next steps

Cite datasets in your publications

If the work you're doing leads to a publication, make sure to reference the dataset pages from datasets you've used for information from the data administrators on how to correctly cite it.

literal
integer
variable
literal
string
literal
string
variable
literal
string
literal
string
@misc{https://doi.org/10.71778/V2DW-7A53,
  doi = {10.71778/V2DW-7A53},
  url = {https://redivis.com},
  author = {Redivis},
  keywords = {FOS: Software},
  language = {en},
  title = {Redivis},
  publisher = {Redivis},
  year = {2025}
}
https://doi.org/10.71778/V2DW-7A53
https://redivis.com
https://www.re3data.org/repository/r3d100014398
RRID:SCR_023111
Stanford Center for Population Health Sciences
eduGAIN
InCommon
HSTS preload list
Stanford
Columbia
UCLA Libraries
Kellogg Business School
Duke Libraries
Mass General Brigham CSPH
variable1_name,variable1_label,variable1_description
variable2_name,variable2_label,variable2_description
sex,patient sex,patient's recorded sex
id,patient identifier,unique patient identifier
// JSON format is an array of objects, with each object representing a variable
[
    {
        "name": "sex",
        "label": "patient sex",
        "description": "patient's recorded sex",
        "valueLabels": [
            {
                "value": 1,
                "label": "Male"
            },
            {
                "value": 2,
                "label": "Female"
            }
        ]
    }
]
data upload
documentation
this python library
dataset editor
InCommon federation
eduGAIN federation
https://redivis.com/auth/saml/metadata
research and scholarship attribute bundle
view their documentation
workflows
study
workspace settings
Work with data in a workflow
Stanford
Columbia
Apply to access restricted data
Work with data in a workflow

Empty

White background

A table node will be empty when it contains no data because the upstream node has not been executed.

Executed

Grey background

A table node will be grey when it has data that aligns with the contents of its upstream node.

Stale

Yellow background

A table node will be stale when an upstream change has been made. This means the content of the node does not match the content of the node above it.

Sampled

Black circle with 1% icon

Only relevant to dataset tables: This means that you are using a 1% sample of the table. When a dataset has a sample, it will automatically default to it when added to a workflow. You can change this to the full sample and back at any time in the dataset node.

Incomplete access

All black background, or dashed borders

You don't have full access the table. Click on the Incomplete access button in the top bar to begin applying for access to the relevant datasets.

Archived

archived pill

The table has been archived, and only its metadata is available. You'll need to re-run the upstream node in order to unarchive the table.

Data sources
transforms
notebooks
tables across Redivis
Work with data in a workflow
Reshape tables in transform
Downloads
Redivis-python library
Redivis-r library
Google Looker Studio
Google Cloud Storage
Google BigQuery

Optimization and errors

Query optimization

The Redivis transform connects to a highly performant, parallelized data store. Queries on terabytes of data can complete in seconds, often utilizing the resources of thousands of compute nodes. These are some best practices that can help you increase performance.

Limit output table size

Table writes are generally much slower than table reads — if your output table is exceptionally large, it may take the querying engine several minutes to materialize the output. Try restricting the number of rows returned by applying row filters to your transforms when possible, and be cognizant of joins that may substantially increase your row count. Avoid keeping variables that aren't needed.

When you are performing initial exploration, you may consider using a limit step in your transform to reduce the output table size.

Reduce the number of new variables

Each new variable adds to the computational complexity of the query; the new variable must be computed for every row in the table.

A common anti-pattern is to construct numerous boolean CASE new variables, and then use the result of these new variables in the row filter(s). If possible, it is far more efficient to inline the CASE logic within the row filters, or within fewer new variables, as this allows for improved efficiency in logical short-circuiting.

Optimize join patterns

When your query utilizes a join step, consider the order in which you are joining the data. The best practice is to place the largest table first, followed by the smallest, and then by decreasing size.

While the query optimizer can determine which table should be on which side of the join, it is still recommended to order your joined tables appropriately.

If all of your joins are INNER joins, the join order will have no impact on the final output. If your query leverages combinations of left / right / inner joins, the join order may affect your output; be careful in these cases.

Common errors

The Redivis transform will prevent you from initiating a run when any steps are invalid, and errors should be rare. However, some errors can only be detected as the query is run — in these cases the job will fail as soon as it encounters an error, logging the error message to the top of the transform.

If you come across an error message not on this list, please email [email protected] for further assistance.

Resources exceeded during query execution

This error occurs when a query utilizes too much memory, yet it is often easily resolvable and due to unintended behavior within the transform. This error is caused by a certain component of the query not being parallelizable, which often occurs when combining and / or ordering many distinct values. We recommend investigating the following culprits:

  1. Investigate any order clauses in your transform — either at the bottom of the transform or in a partitioned query. Often, attempts to order on hundreds of millions or billions of distinct values will fail. Note that ordering rows at the bottom of the transform does not affect your output, and is only useful in preparing your data for export.

  2. Confirm that none of your aggregation methods are creating massive cells. For example, using the String aggregate method on an exceptionally large partition can collapse and concatenate many values into one record — if the cell becomes too big, this error will be thrown.

Cast / type conversion errors

When converting between variable types, all values must be appropriately formatted for conversion to the new type. For example, the value "1,000" is not a valid integer and will throw an error when being converted from a string to an integer.

There are several options for getting around cast errors:

  1. Choose the "If invalid for type, set to null" option in your retype blocks. Note that this will set all invalid values to NULL, potentially causing unintended side effects. Use with caution.

  2. Filter out all records that have incompatible values.

  3. Create a new variable, using the Case method to convert any invalid values to something that can be appropriately cast.

Maximum table size

If an output table is more than 1TB, it cannot exceed the size of the sum of all source tables, + 10%. Very large output tables that substiantially exceed their inputs are typically the result of a misconfigured join that generates a cross-product between a one-to-many or many-to-many relationhip between multiple tables. If you encounter this error, try to apply filter and aggregation steps first, and also validate that your join conditions are appropriately specific.

Too many tables, views and user-defined functions for query: Max: 1000

This error may occur when running queries on tables belonging to an unreleased version, particularly when these tables are made up of hundreds of independent uploads. Under the hood, these unreleased tables are represented as a logical view that stitches the various uploads together into a single table. If your query references multiple unreleased tables, with each approaching the 500 per-table upload limit, it's possible to exceed the total allowed number of tables referenced by a query.

To work around this issue, you can create a transform that simply selects all variables from the unreleased table, materializing the result in your workflow. This output table will now only count as a single table in your query, avoiding this error.

This error will also no longer be an issue once the version is released, as the table is materialized shortly after a version is released.

Other errors

If you come across any other errors or issues while using the transform please contact us directly at [email protected]

Embedded view of a table with 2.7B records. Click around to interact!
APIs
API
s
Stanford
Columbia
upload your own data
contact us
Discover datasets
Apply to access restricted data
Analyze data in a workflow
Create & manage datasets

Getting started

Overview

Redivis is a platform that allows researchers to seamlessly discover, access, and analyze data. This brief guide will walk you through the basics to get up and running and provide a launching point for exploring other resources in this documentation.

1. Create your account

Many datasets on Redivis are public, and you can browse them without creating an account. However, you'll need an account in order to analyze data, as well as to apply for access to restricted datasets.

Click the Create account button in the top right of any page to sign up. You can use your academic institution's login credentials or any Google account.

Once you create an account, you'll be navigated to your Workspace, which is your private area for creating workflows to work with data, uploading your own datasets, and managing your Redivis account.

2. Join your organization

The main entity on Redivis is a dataset. Most datasets on Redivis are uploaded by organizations.

You might already know the organization that you’ll be working with, or you can browse organizations by clicking the Find organizations button on the Organizations tab your Workspace. You might know your organization's URL already, or see it listed in the Recent section of the Dashboard if you were already viewing that page.

Once you join an organization, you'll be able to apply for any of their restricted datasets, and also see a link on the left bar of your workspace to quickly get back to the organization's home page.

3. Find datasets

The best place to find datasets on Redivis is to search on an organization's home page. Here you can click on the Datasets tab to browse all datasets and filter by their metadata. All searches perform a full-text search across a dataset, its documentation, tables, variables, and rich metadata content.

If your organization is part of an institution (such as Stanford or Columbia) you can also go to the institution page to search all datasets in your institution.

Click on any dataset title to go to that Dataset page where you can view the data and metadata it contains. The data for this dataset will be available on the Tables tab, which you can explore further.

You will also probably come across restricted datasets. For these you will need to click the Apply for access button on the top right of the Dataset page and complete the requirements to gain approval from the dataset's administrators.

Learn more in the Discover & access data guide.

4. Analyze data

Once you've found a dataset that you want to work with, you can add it to a Workflow. Workflows are the fundamental analysis interface on Redivis, where you can query, merge, reshape, and analyze any dataset that you have access to — all from within your web browser.

Add a dataset to a workflow by clicking the Add to workflow button on the top right of the dataset page.

In a workflow we can create a Transform by selecting any table and clicking the +Transform button, which allows us to combine and reshape our data into a final output table that best serves our analysis. These transforms use a powerful SQL engine under the hood, allowing us to query incredibly large tables - even billions of records - in seconds.

After creating output tables for analysis, we can create a computational Notebook in R, Python, Stata, or SAS to further analyze our data and develop our final figures. Select any table and click the +Notebook button to get started. The notebook will initialize and pull in the table you've selected (or a sample of the table if it is a large table).

We can also export and query this table from external environments, allowing us to use whatever analytical tools best suit our research question by clicking the Export table button on the right side of any table. Here we can download the table in a number of common formats, interface with the it programmatically via the API, or export the table to supported integrations.

Learn more in the Analyze data in a workflow guide.

Next steps

Upload your own datasets

Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).

Learn more in the Create & manage datasets guide.

Administer your organization

Organizations allow for research groups and centers to securely distribute data to their research community. Organization administrators can create datasets, manage access, review logs, and create customized reports of their data utilization.

Contact an existing administrator to add you to their organization, or contact us to set up a new organization. Learn more in the Administer an organization guide.

Running ML workloads

Overview

Notebooks on Redivis offer a performant and highly flexible environment for doing data analysis. This includes the ability to run state-of-the-art machine learning (ML) models, including the ability to train new models, fine-tune various existing models, and use these models to perform inference and generate novel outputs.

This guide is focused on a number of common use cases when running ML workloads on Redivis. Here we generally focus on using the Hugging Face + PyTorch ecosystem in Python, though these examples are broadly applicable to other ML libraries and languages.

For a detailed example of using Redivis to fine tune a large-language model, see the complementary example:

1. Create a notebook with appropriate computation capacity

Training ML models and running inference can require a substantial amount of compute capacity, depending on various factors such as your model and dataset size, usage parameters, and performance goals.

The default, free notebook on Redivis offers 2CPUs and 32GB of RAM. While this may work for initial exploration, running practical machine learning workflows typically requires the availability of a GPU. When creating your notebook, you can choose a custom compute configuration to match your needs.

Redivis offers a number of custom compute configurations, mapping to the various machine types available to Google Cloud. We recommend starting with a more modest GPU for initial exploration, and then upgrading as needed when computational or performance bottlenecks are reached. For this example, we'll use the NVIDIA L4 GPU, which provides reasonable performance at a reasonable cost.

2. Define dependencies

The Redivis python notebook is based off the jupyter-pytorch notebook image, with PyTorch, CUDA bindings, and various common data science libraries pre-installed. However, if you require additional dependencies for your work, you can specify them under the "dependencies" section of your notebook.

2a. [Optional]: Pre-load external models when internet is disabled

Models can also be stored within a dataset on Redivis, in which case you won't need to pre-load the model, but can rather import it directly from the Redivis dataset (see step 3 below)

If your notebook references export-restricted data, for security reasons, internet will be disabled while the notebook is running. This can present a challenge for some common approaches to ML in python, such as downloading a model or dataset from Hugging Face. For example, we might reference a model as follows:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

This code typically downloads the model from Hugging Face and caches it to our local disk. However, if internet is disabled, this command will hang and ultimately fail. Instead, we need to download the model during notebook startup, before the internet is disabled, as part of the post_install.sh script under the notebook's dependencies:

python -c '
from huggingface_hub import snapshot_download
snapshot_download(repo_id="sentence-transformers/all-MiniLM-L6-v2")
'

This will download the model weights and other files to the default Hugging Face cache directory, ~/.cache/huggingface/hub.

Now, within our notebook, we can load the cached model. Make sure to set local_files_only=True, so that Hugging Face doesn't try to connect to the internet to check for a newer version of the model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2", local_files_only=True)

Some libraries may require you to provide a full path to the model files. For the model mentioned above, this would be ~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/<snapshot_id> .

To find the appropriate path, list the contents of the ~/.cache/huggingface/hub directory and its decendants.

3. Load a model

3a. Load a model from Redivis

Machine learning models can be stored directly within a Redivis dataset as unstructured files. For example, this dataset contains the various files that make up the bert-base-cased model on Hugging Face. We can then download the model to our notebook's local filesystem:

import redivis
table = redivis.organization("demo").dataset("huggingface_models").table("bert_base_cased")
table.download_files("/scratch/bert-base-cased")

And then reference this as a local model. E.g.:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("/scratch/bert-base-cased", num_labels=5)
tokenizer = AutoTokenizer.from_pretrained("/scratch/bert-base-cased")

When using models stored on Redivis, we don't have to worry about whether our notebook has internet access, nor do we need to rely on the future availability of that particular model on Hugging Face.

To download a model from Hugging Face and save it to a Redivis dataset, you can either download the files from Hugging Face + re-upload them to Redivis, or alternatively, you can use a notebook to programmatically upload the files. E.g.,:

from huggingface_hub import snapshot_download
import redivis

# Download the model files from Hugging Face
snapshot_download(repo_id="google-bert/bert-base-cased")

# Specify an existing dataset and table on Redivis. 
# Consult the python docs for how to programmatically create datasets (apidocs.redivis.com)
table = redivis.organization("demo").dataset("huggingface_models").table("bert_base_cased")

# Add the downloaded model files to the table
table.add_files(directory='/home/root/.cache/huggingface/hub/models--google-bert--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e')

3b. Load a model from an external source

If your notebook has internet access, you can also use any other models that may be available on the internet. For example, we can load the same bert-base-cased model directly from Hugging Face:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

# Do cool things!

Note that if you are working in a notebook with disabled internet, this approach won't work, and you'll need to use the methods mentioned in either 2a or 3a above.

4. Load data

As a final step, you'll likely want to load data into your notebook to either fine-tune the model, perform inference, or otherwise experiment. There are thousands of datasets on Redivis, and you can upload your own data as well. You can learn more about loading data into a notebook in the python notebooks documentation. As a quick example:

import redivis

# Load tabular data as a pandas data frame, or a number of other formats (arrow, polars, iterator, etc)
table = redivis.organization("demo").dataset("ghcn_daily_weather_data").table("stations")
df = table.to_pandas_dataframe()

# Download unstructured data as files to your local disk
files_table = redivis.organization("demo").dataset("chest_x_ray_8").table("images")
files_table.download_files("/scratch/xray_images")

Of course, assuming your notebook has access to the external internet, you can also call various APIs to load external data sources.

from datasets import load_dataset

# load a dataset from Hugging Face
hf_dataset = load_dataset("Yelp/yelp_review_full") 

Next steps

At this point, you have all the tools at your disposal to perform cutting edge ML research. But of course, what you do next is totally up to you. We do recommend further familiarizing yourself with the examples and detailed documentation to take full advantage of the capabilities of Redivis notebooks:

  • Example workflow, Fine tuning a Large Language Model

  • Python notebook reference

  • R notebook reference

Fine tuning a Large Language Model (LLM)

This guide demonstrates using a Redivis workflow to import an existing LLM and then use relevant data to fine tune it and run it on another similar set of data we are interested in.

Workflow objective

Here, we want to fine-tune a pre-trained "foundational" LLM so that it can be used to score reviews. We will leverage an existing dataset that contains a collection of Yelp reviews and their scores to perform the fine-tuning, and then apply this classification model to other reviews (from Reddit) that do not contain an accompanying score. The goal here is to demonstrate how Redivis can be used to leverage, modify, and ultimately apply state-of-the-art LLMs to novel data.

This workflow is on Redivis! We also suggest you recreate this workflow as we go to best learn the process.

1. Choose and explore data

For this workflow we'll need our initial data to train the model on (in this case Yelp reviews) and the data we want to apply the model to (Reddit posts). These data are already on Redivis, split across two datasets uploaded to the Redivis Demo organization: Yelp Reviews (Hugging Face) and Reddit.

Yelp reviews

To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need.

It looks like there are two tables, one with reviews for testing a model and another with reviews for training a model. Clicking on each table in this interface shows that they both have two variables (label and text) and that the Train table has 650,000 records while the Test table has 50,000 records.

This data seems to be formatted exactly how we'll want to use it so we don't need to do additional cleaning .

Reddit

This dataset has two tables with over 150 million Reddit posts and subreddit information, split across two tables. We can look more closely at the 33 variables in the Reddit posts, including univariate statistics.

For this workflow, we just want to look at reviews from one specific subreddit which reviews mouse traps: MouseReview. If we click on the Subreddit variable name, we can see a searchable frequency table with all this variable's values. If we search MouseReview we can see that this dataset contains 26,801 posts.

To move forward with this workflow we'll want to train a model on the Yelp dataset, and filter and clean the Reddit table make it more usable with our model. In order to clean or transform data and do our analysis we'll need to create a workflow.

2. Identify a base model to fine-tune

We want to leverage an existing model that understands language and can generally be used for language classification. There are many open-source models that might meet our needs here; in this example, we'll use Google's BERT-base-cased model.

This model is hosted on Hugging Face, so we could load it directly into our notebook at runtime. However, if our notebook uses restricted data, it might not have access to the external internet, in which case we'll need to load the model into a dataset on Redivis.

The Redivis dataset for this model can be found here. You can also learn more about loading ML models into Redivis datasets in our accompanying guide.

3. Create a workflow

At the top of any dataset page, we can click the Analyze in workflow button to get started working with this data.

You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.

Add the additional datasets by clicking the + Add data button in the top left corner of the workflow and searching for the dataset by name.

4. Create a notebook and load model + data

Once we've added all our datasets to the workflow, we can get started. To begin, we'll create a python notebook based on the Yelp reviews training data, by selecting that table and clicking the + Notebook button.

To enable GPU acceleration, before starting the notebook, we'll choose a custom compute configuration with an NVIDIA-L4 GPU, which costs about $0.75 per hour to run (we could use the default, free notebook for this analysis, but it would take substantially longer to execute).

Configure our notebook to include a GPU

We'll also need to install a few additional dependencies to perform training and inference via Hugging Face python packages:

Specify dependencies by clicking the "Dependencies" button

With that, we can start our notebook! The full annotated notebook is embedded below, and also viewable on Redivis.

The general steps here are as follows:

  1. Load the training and test data from the Yelp reviews dataset

  2. Load the pretrained BERT model

  3. Train this base model on the Yelp data to create a fine-tuned model that can classify text reviews from a score of 0-4.

5. Prepare the Reddit data for inference

The Yelp data was ready to go as-is, with a simple text field for the review and integer value for the score. For the Reddit data, we just need to run a quick filter to choose posts from the appropriate sub-reddit.

We will use a transform to clean the data, as transforms are best suited for reshaping data at scale. Even though we might be more comfortable with Python or R, this dataset table is 83GB and it will be much easier and faster to filter it in a transform rather than a notebook.

Create a transform

Click on the Posts table in the Reddit dataset and press the + Transform button. This is the interface we will use to build our query.

Add a Filter step. Conceptually we want to keep records that are part of the subreddit we are interested in, and are not empty, deleted, or removed.

The final step in a transform is selecting which variables we would like to populate the resulting output table. In this case we just need the variables title and selftext.

With everything in place we will run this transform to create a new table, by pressing the Run button in the top right corner.

7. Use the fine-tuned model to classify Reddit reviews

Finally, we can apply our fine-tuned model to the subset of Reddit comments that we want to analyze. Ultimately, we produce a single output table from the notebook, containing the reddit post and associate score generated by our model.

Next steps

Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them. Perhaps we want to look at other subreddits, or run the model on a larger sample of the Yelp data.

You can also fork this workflow to work on a similar analysis, or export any table in this workflow to analyze elsewhere.

We do recommend further familiarizing yourself with the examples and detailed documentation to take full advantage of the capabilities of Redivis notebooks:

  • Guide to running ML Workloads on Redivis

  • Python notebook reference

  • R notebook reference

Create & manage datasets

Overview

Datasets are a core component of Redivis. They are a versioned collection of tables containing data, alongside rich documentation and metadata.

Datasets can be hosted by organizations or individual users, and every dataset has its own Dataset page. Datasets can be shared with other users on Redivis according to their access configuration.

1. Create a dataset

Administrators can create datasets for their organization from the Datasets tab of their organization's Administrator panel. These datasets can be seen and managed by any administrator in the organization. When released they will be visible on the organization’s home page to anyone who as overview access tot he dataset.

Alternatively, anyone with a Redivis account can create a dataset on the Datasets tab of their Workspace. These datasets are by default only visible to their owner, and have simplified options to support sharing with your collaborators.

When you first create a dataset, it will be unpublished and only visible to other editors. This means you can edit the dataset and validate everything before releasing it. You can also validate and reconfigure its access rules before release.

2. Import data

At the core of every dataset is the data it contains, so we recommend starting here.

All data in a dataset is stored in tables. You can create a new table on the Tables tab of your dataset and start importing data. Redivis can upload data from your computer or another location you’ve linked, such as Box, Google Drive, AWS, and Google Cloud.

Once your data is finished importing you can validate that this table looks as you expect it to.

You can create more tables here if this dataset has multiple separate tables.

However, if your data is split across multiple files that all follow the same structure (such as a different file for each state, or each year of data but with generally the same variables) you will want to import all of these files to the same table, where they will be automatically appended together.

Learn more in the Upload tabular data as tables guide.

3. Populate metadata

Metadata is essential to helping your researchers find and utilize your dataset. While some metadata will be generated automatically, such as variable summary statistics and counts, other metadata will require additional input.

Dataset metadata

On the Overview tab of the dataset there are multiple suggested sections to help break down information. You can fill out any of these that apply, such as the methodology, tags, contact information, etc.

Redivis will automatically generate citation and provenance information based on what we know about the dataset, but you can update this information with anything more specific.

If you have additional information that you want to include that doesn't fit one of these headers, you can create a custom section. Custom sections can also be set to be visible only to certain access levels if you have sensitive information.

Table metadata

You should also populate the metadata on each table. Tables can have a description, as well as an entity field that defines what each row in the table represents. You can also define the temporal and geographic range on the table, when relevant.

Variable metadata

Each variable within a table has its own metadata. The variable name and type will be pre-determined from your data, but you should add a short label and longer description to each variable to help researchers understand what that variable measures.

Additionally, some variables will contain coded values, in which case you should provide value labels that represent the human-readable term for each code.

Learn more in the Create and populate a dataset guide.

4. Release

Once you are ready to make your dataset available to others, you'll need to release it. You can click the Review and publish or Review and release button in the top right of the page.

You'll want to double check all of your data before moving forward. While you can continue to edit the documentation and metadata after the version is released, the data in a version cannot be changed.

You can unrelease a version for up to 7 days after release, though this should generally be avoided. When updating your data to correct for mistakes, you’ll need to release a new version.

You should also confirm your access settings you set up when creating this dataset.

Once this dataset is released, it will become visible and available to anyone who would qualify for access.

Next steps

Edit data

You can use tools right on Redivis to create new versions of your dataset.

Learn more in the Edit data in a dataset guide.

Administer your organization

Organizations allow for groups, centers, and institutions to more easily work with data by providing administrators with tools to effectively version and distribute data from a central location.

Contact us to set up a new organization, and learn more in the Administer an organization guide.

Cleaning tabular data

Overview

Redivis datasets are a great place to host data for interrogating in a workflow, but you can also edit the underlying data.

Perhaps you found an issue with the source, or want to restructure it, before making it available to others. Redivis has all the tools you'll need to do this in a versioned, step-by-step process with transparency into the changes you've made.

1. Upload the raw data

If you haven't yet, upload the data you want to work with to a Redivis dataset and release the version. A personal dataset or one belonging to an organization works just as well.

2. Add this dataset to a workflow

Create a new workflow and add this dataset to it.

If you want to share the data transformation process with others for transparency, you can make this workflow public in the share modal. (Due to the way Redivis works, still only people with access to the underlying data will be able to see the data in the workflow even though the workflow is public.)

For this guide you can follow along in this workflow Demo tables edits.

3. Transform data

Select the table you want to make changes to and create a new transform.

Tip: you can reference the names of specific uploads within your table using upload pseudo-variaables.

Use this transform to edit this table. Some common actions include:

Example edit 1: Rename and retype a variable

Create new steps to rename and retype any variables you'd like to update.

Example edit 2: Recoding the values of a variable

Create a new variable with the same name. "Keep" this variable and "Discard" the original variable. Select the method as "Case (if/else)" and create the conditions you want to recode to.

Example edit 3: Add a new variable

This example shows a new variable with today's data as the "date_uploaded" but you could create any variable you want from the data in the table. Perhaps a aggregation that would be helpful to see with this data? Or the sum of multiple other variables?

You can transform this table however you want, using the Graphical interface or SQL code. Make sure to move all variables in this table from the "Discard" section to the "Keep" section except for any variables that have been replaced and will be left behind (in this case the store_and_fwd_flag from the source table).

Validate that this new table looks correct by looking at the output table below this transform!

Make changes to any tables in this dataset in this same wokflow.

4. Re-upload the finished tables

Go back to the original dataset, and create a new version.

Open a table that you made changes to in your workflow, and click to "Import data."

Choose your merge strategy as "Replace" since you want to replace the existing table with the new one you created.

Select the data source as "Redivis" and type in the information from the workflow you made your changes. The sequence here will be:

  1. Your user name

  2. The name of the workflow (underscores replace spaces)

  3. The table name (underscores replace spaces)

For example: username.testworkflow.table_name

Once it's done uploading you can validate that the data looks like it's supposed to.

Now you can close this table and repeat this process for any other tables in this dataset you've edited.

Once this is finished, you can add the link to the worfklow into this dataset's documentation or release notes along with a note on the changes. Remember that if you made the workflow public, anyone who has data access to this dataset can view the changes you made.

Now release this version! This new version of the data contains the edited data. Anyone using this data in a workflow will see that this dataset has a new version next time they open a workflow. You can see the release notes and the updated table of our Demo tables updates live.

Redivis stores data as compactly as possible so if you are concerned with data storage costs, only any new records created will create more storage need. If you would like to delete the first version of the data you can also do that by opening the version modal in the dataset editor and clicking "Delete version" here.

Next steps

Start working with your data

Once your dataset is released, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Learn more in the Analyze data in a workflow guide.

Notebooks

Overview

Notebooks provide a flexible computation environment for data analysis, connecting Redivis tables to the scientific compute stack in Python, R, Stata, and SAS. Redivis notebooks are built on top of Jupyter notebooks and the .ipynb notebook format.

From within a notebook, you will be able to query and analyze any table within your workflow, and (optionally) generate an output table from the notebook. Because data referenced in a notebook never leaves Redivis, you can securely analyze data that would otherwise be bound by export restrictions.

See our Analyze data in notebooks guide for a step-by-step walkthrough of using notebooks on Redivis.

Notebook node states

As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress.

State
Display
Details

Starting

Spinning circular icon

The notebook is being provisioned and will be available for use once it has started.

Running

Purple dot on the node corner

The notebook has started and cells can be executed. It will continue running until it is stopped or times out.

Stopped

Grey background

The notebook has stopped and only a read-only view is available.

Errored

Red exclamation icon

An error has occurred during the startup process or an issue has happened requiring a restart.

Stale

Yellow background

An upstream table has been changed since the notebook but last started.

Incomplete access

All black background, or dashed borders

You don't have full access the node. Click on the Incomplete access button in the top bar to begin applying for access to the relevant datasets.

Step: Rename

Overview

The Rename step changes the name of an existing variable.

Example starting data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Rename score to final_score

/*---------+--------------*
 | student | final_score  |
 +---------+--------------+
 | jane    | 83           |
 | neal    | 35           |
 | sam     | 74           |
 | pat     | 62           |
 *---------+--------------*/

Step structure

  • There will be at least one rename block where you will define a variable and a new name.

Input field definitions

Field
Definition

Source variable

The variable that you want to rename.

New name

The name that your selected source variable will be given. Note that this input needs to follow Redivis variable .

Example

We can change the name of a variable to give more meaning to the data.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 100   | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 100   | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Source variable: The variable we want to rename is date so we select that here.

  • New name: date_taken would be a better name for this variable, so we enter it here.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date_taken |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 100   | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 100   | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Transforms

Overview

A transform node in a Redivis workflow provides a powerful querying tool to reshape one or more tables into a new output table.

You will build your transform by choosing steps one at a time based on the actions you want to take on your data. At any point you can run the transform to execute each step sequentially and view the resulting table.

The most commonly used transforms steps can:

  • Aggregate a table

  • Create variables based on existing data

  • Filter out rows you no longer need

  • Join additional tables

The transform interface translates your interface inputs to SQL code that you can view and edit. You can also intermix SQL code you've written between interface steps, or build an entire transform of your own SQL code if you wish.

See our Reshape tables in transforms guide for a step-by-step walkthrough of using transforms on Redivis.

Transform node states

As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress.

State
Display
Details

Empty

White background

A run has never been attempted.

Executed

Grey background

A run has successfully completed and no changes have been made since then.

Invalid

Black exclamation icon

A run is not possible. This might be because you haven't finished building the steps or have input invalid information.

Errored

Red exclamation icon

A run has finished unsuccessfully. This might be due to an incorrect input you've set that our validator can't catch. Or something might have gone wrong while executing and you'll just need to rerun it.

Edited

Grey hash marks

Changes have been made to the contents of the transform since it was last run. You can either Run this transform or Revert to its previously run state to resolve it. Editing a transform makes the downstream table stale.

Stale

Yellow background

An upstream change has been made. This means the content of the node does not match the content of the node above it.

Queued

Static circular icon

Nodes that are waiting for an upstream run to finish before running themselves. When you run multiple nodes they will be executed in a logical order. Once upstream dependencies are finished, queued nodes will automatically start running.

Running

Spinning circular icon

A run is currently in progress. You can cancel running (or queued) transforms by clicking the Run menu in the top bar and selecting Cancel. If a node is currently running it might not be able to cancel, depending on what point in the process it's at.

Incomplete access

All black background, or dashed borders

You don't have full access the node. Click on the Incomplete access button in the top bar to begin applying for access to the relevant datasets.

Upload tabular data as tables

Overview

Redivis offers extensive tools for previewing tabular data and transforming it in workflows, but the data needs to be uploaded correctly in a tabular format for researchers to utilize these tools.

This guide assumes you have already started by Creating a dataset.

1. Locate the data you want to upload

You can upload data directly from your computer, or import from a linked account.

If importing, you'll want to get the relevant external account configured to your Redivis account before getting started.

The import tools allow for multiple uploads, so no need to combine files together before importing them, but it's helpful to have them all in the same place.

2. Create tables

On your newly created dataset, the first step to uploading data is to create one or more tables that the data will be uploaded to.

The data files you currently have may or may not be how you want to store them on this dataset, so it's important to think about your data's structure before getting started.

For example, if you have multiple files that all follow the same schema, we strongly recommend uploading them as one table (for example, if you have a separate table for each year, or a separate table for each state, but the structure of each is the same). In the example of one file per state, this would allow researchers to query across all states, skipping the first step of doing up to 50 joins. Additionally, you generally shouldn't split out tables for performance reasons — even when querying billions of records, Redivis will execute in seconds.

When uploading files to tables, remember that every row in a table should represent the same "thing", or entity; we wouldn't want to combine county-level and state-level observations in one table.

If you haven't already, we very strongly recommend experimenting with the reshaping and analytic tools in a workflow which researchers will use to work with your dataset. Knowing how they will work with it might inform how you structure it during this setup process, and can save time for everyone. You can even add your unreleased dataset to a workflow for testing — click on "View dataset page" from the dataset overview, and then add the unreleased dataset to your workflow.

When you're ready click the Create new table button on the Tables tab of the dataset page and name your table to get started.

3. Upload tabular file(s) to create a table

To get started uploading, choose the data source. By default this is your computer, but you can choose any option from the dropdown menu.

Next, choose the file(s) or enter the paths of the file(s) you want to import.

If you select multiple files here, they will be automatically appended in this single table on upload based on common variable names. If a variable is missing in some of the files, that's ok, it will just be recorded as null for all records in that file.

For a full list of supported file types, as well as advanced functionality (such as wildcard imports) and error handling techniques, consult the Uploading data reference.

Once your files are selected, click the Import button. If the files are coming from your computer, you might need to wait until they are finished uploading to the browser before they can be imported into Redivis.

Learn more in the Uploading data reference section.

4. Verify uploads

As you upload files, you will see an overview of any files' progress and can click to view each file's data and additional information.

Once all uploads have completed, you can inspect the table (representing the concatenation of all of your uploads). Make sure to check the summary statistics and other analytical information to validate that the data are as you expected.

If you have more files to upload you can click the Manage imports button on the right side of the table at any time (up until releasing this version of the dataset).

Next steps

Continue uploading your dataset

Great metadata makes your dataset useable. Complete your metadata, along with configuring access, creating a sample, and releasing this version.

Learn more in the Create & manage datasets guide.

Step: SQL query

Overview

Redivis supports direct SQL queries for advanced functionality and for users who prefer to work in a programmatic interface. You can add one or more SQL query steps to any transform.

This step accepts SQL code and will execute in the order it has been placed within the rest of the steps.

Query syntax

SQL Dialect

Redivis supports the BigQuery Standard SQL Syntax and features, with certain limitations.

Referencing the source table

All queries must reference the SQL step's source, represented as _source_ in the query. If the SQL step is the first step in the transform, the source represents the transform's source table. Otherwise, the source represents the output of the previous step in the transform.

For example:

SELECT * 
FROM _source_
ORDER BY mean_drg_cost
LIMIT 1000

Joining tables

You can also reference any other table in the current workflow from your SQL step, following the same reference rules as those in the Redivis API. After typing the backtick character (`), available tables will be presented as autocomplete options.

For example:

SELECT t0.id, t0.name, t1.latitude, t1.longitude
FROM _source_ AS t0
INNER JOIN `demo.ghcn_daily_weather_data:7br5:v1_1.stations:g2q3` AS t1
    ON t0.id = t1.id

The query editor will automatically qualify referenced tables and value lists as you reference them, which ensures your query will continue to work if tables get renamed. In most cases, you can just reference the table by name, and the fully qualified reference will be inserted for you.

Referencing upstream variables

You can mix and match SQL steps with other step types within your transform. If you choose to do so, this will introduce some additional language semantics, since transforms allow variables with the same name to co-exist at any given step, delineated in the UI with the t1, t2, ... and v prefixes. The v prefix denotes any newly created variable, while the tN prefix denotes variables that have been joined from other tables.

In order to reference these variables, you must include the prefix in your SQL code before the variable name. The only exception to this rule is for variables prefixed by t0 — these should be referenced normally, allowing you to ignore these rules if you aren't mixing the SQL interface with other step types.

For example:

SELECT 
-- These variables are from the transform's source table, denoted by `t0` in the UI
    id, name, 
-- These variables were joined upstream, denoted by `t1` in the UI
    t1_latitude, t1_longitude, 
-- This variable was created upstream, denoted by `v` in the UI
    v_extracted_year
FROM _source_

Referencing value lists

You may also use value lists in your query — they can be referenced as a SQL parameter, prefixed with the @ symbol. Note that all value lists are explicitly typed, and you will need to cast the value list if there is a type mismatch.

For example:

SELECT provider_id 
FROM _source_
WHERE provider_state IN UNNEST(@`states:8a2h`) 
    -- If a value list is in the incorrect type, use the following code to convert it
    OR provider_fips IN UNNEST(
        (
            SELECT ARRAY_AGG(CAST(val AS STRING))
            FROM UNNEST(@`fips_codes:134a`)val
        )
    )
ORDER BY mean_drg_cost
LIMIT 1000

-- @states = ['CA', 'TX', etc...]
-- @fips_codes = [6071, 2999, 3242]

Limitations

  • DDL / DML queries are not supported

  • All queries must return fields of a valid Redivis type. The following BigQuery types are not supported in the result set, though they may be used internally within the query:

    • TIMESTAMP

    • NUMERIC

    • BIGNUMERIC

    • BYTES

    • INTERVAL

    • JSON

    • ARRAY

    • STRUCT

  • BigQuery ML syntax is not supported

  • BigQuery scripting language is not supported

  • Statements may only reference tables that exist in your workflow

  • Unnamed variables in the result are not supported

  • Duplicate variable names in the result are not supported

Errors

If your query is invalid, you will see an invalid icon and the error message displayed above your code. Transforms can't be run while any step is invalid.

Compute resources

Notebooks on Redivis provide a highly flexible computational environment. Notebooks can be used for anything from quick visualizations to training sophisticated ML models on a large corpus of data.

Understanding the compute resources available, and when to modify which parameters, can help you take full and efficient advantage of the high-performance computing resources on Redivis.

Default (free) notebooks

The default notebook configuration on Redivis is always free, and provides a performant environment for working with most datasets. The computational resources in the default notebook are comparable to a typical personal computer, though likely with substantially better network performance.

The default free notebook configuration offers:

  • 2 vCPUs (Intel Ice Lake or Cascade Lake)

  • 32GB RAM

  • 100GB SSD:

    • IOPS: 170,000 read | 90,000 write

    • Throughput: 660MB/s read | 350MB/s write

  • 16Gbps networking

  • No GPU (see custom environments below)

  • 6 hr max duration

  • 30min idle timeout (no code is being written or executed)

Custom compute configurations

For scenarios where you need additional computational resources, you can choose a custom compute configuration for your notebook. This enables you to specify CPU, memory, GPU, and hard disk resources, while also giving you control over the notebook's max duration and idle timeout.

In order to customize the compute configuration for your notebook, click the Edit compute configuration button in the notebook start modal or toolbar.

Choose from hundreds of machine types to allocate additional resources to your notebook

Custom machine types

Redivis makes available nearly every machine type available on Google Cloud. These machines can scale from small servers all the way to massively powerful VMs with thousands of cores, terabytes of memory, and dozens of state-of-the-art GPUs.

These machine types are classified by four high-level compute platforms: General purpose, memory optimized, compute optimized, and GPU. Choose the platform, and machine type therein, that is mose appropriate for your workload.

When your notebook is running, make sure to keep an eye on the utilization widgets to see how much CPU / RAM / GPU your notebook is actually using. This can help inform whether your code is actually taking example of all of the compute resources.

For example, if your code is single-threaded, adding more CPUs won't do much to improve performance. Similarly, you might need to adjust your model training and inference workflows to take advantage of more than one GPU.

If you find that you've under- or over-provisioned resources, you can simply update your machine configuration and restart the notebook.

Custom machine costs

All custom machines have an associated hourly cost (charged by the second). This cost is determined by the then-current price for that machine configuration on Google Cloud.

In order to run a custom machine, you must first purchase compute credits, and have enough credits to run the notebook for at least 15 minutes. If you run low on credits and don't have credit auto-purchase configured, you will receive various alerts as your credits run low, and ultimately the notebook will shut down when you are out of credits.

Maximizing notebook performance

All notebooks on Redivis use either Python, R, Stata, or SAS. While Redivis notebooks are highly performant and scalable, the coding paradigms in these languages can introduce bottlenecks when working with very large tabular data. If you are running into issues with performance we suggest:

  • Use transforms to clean and reduce the size of your data before analyzing them further in a notebook. When possible, this will often be the most performant and cost-efficient approach.

  • Adjust your programming model to load data lazily or on-disk to exceeding memory limits. See our suggestions for working with larger tables in Python, R, Stata, and SAS.

  • Adjust the compute resources in your notebook. This may help to resolve these bottlenecks depending on what is causing them!

Some quick rules of thumb:

  • < 1GB: probably doesn't matter, use what suits you!

  • 1-10GB: probably fine for a notebook, though a transform might be faster.

  • 10-100GB: maybe doable in a notebook, but you'll want to make sure to apply the right programming methodologies. Try to pre-cut your data if you can.

  • >100GB: You should probably cut the data first in a transform, unless if you really know what you're doing.

Data sources

Overview

A data source node contains data across Redivis that you want to work with and have access to. These are usually but can also be other workflows. These nodes display overview information about the dataset or workflow it represents, and a list of the tables it contains.

You can click on any table to view its contents, or click "Transform" to build a transform on it.

Adding data source nodes to a workflow

Adding a data source to a workflow will make a copy of that dataset or workflow in the form of a circular node at the top of your workflow tree. You can add data to the workflow by:

  • Click the + Add data button in the top left of the toolbar within a workflow.

  • Click the Analyze in workflow button on a dataset page.

  • Click the Fork button in the toolbar of any workflow.

Restrictions

No workflow can contain two copies of the data source or the same version of the same dataset. You can add a different version of the dataset to your workflow by right clicking on the dataset name in the + Add data modal.

Datasets as a data source

are the most common Data source you will add to your workflow. They contain the data you want to work with in the original state curated by the data owner. Usually datasets will contain 1 or more tables you can choose to transform or analyze in a notebook.

Dataset samples

Some large datasets have 1% which are useful for quickly testing querying strategies before running transforms against the full dataset.

If a 1% sample is available for a dataset, it will automatically be added to your workflow by default instead of the full sample. Samples are indicated by the dark circle icon to the top left of a dataset node in the left panel and in the list of the dataset's tables.

All sampled tables in the same dataset will be sampled on the same variable with the same group of values (so joining two tables in the same dataset with 1% samples will still result in a 1% sample).

​To switch to the full sample, click "Sample" button in the top right of the menu bar when you have a dataset selected.

Your downstream transforms and tables will become stale, since an upstream change has been made. You can run these nodes individually to update their contents, or use the run all functionality by clicking on the workflow's name in the top menu bar.

Dataset versions

When a new of a dataset is released by an administrator, the corresponding dataset node on your workflow tree will become purple. To upgrade the dataset's version, click the "Version" button in the top right of the menu bar when you have a dataset selected.

You can view version diffs and select whichever version you want to use here.

After updating, your downstream transforms and tables will become stale. You can run these nodes individually to update their contents, or use the run all functionality by clicking on the workflow's name in the top menu bar.

Workflows as a data source

can be added to another workflow in order to build off existing analysis. You might want to continue an analytical pipeline that you've built elsewhere, or elaborate on someone else's analysis. You will have access to all tables this workflow contains.

All workflow data sources are linked to their original workflow and will automatically update with any changes made to the original.

Data source node states

As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress.

State
Display
Details

Administer an organization

Overview

Organizations are where administrators centrally manage datasets and who can access them. All organization administration happens from the Administrator panel which administrators can access from the Organization home page.

If you're interested in using Redivis for your research group, center, or institution, to get an organization set up.

1. Add other administrators

You can add any Redivis account to be an administrator of your organization in the Administrator panel from the Settings tab.

Note that all administrators will have full access to the organization and its datasets, including the ability to modify access rules, approve access requests, and add or remove other administrators, including you.

2. Build your organization's presence

To help brand your organization and best communicate the organization’s purpose to others, you'll want to fully populate and brand the Organization home page.

In the Administrator panel on the Settings tab / Public profile section you can customize:

  • The organization's full name

  • An organization description which will appear at the top of your about section, and in any institution searches

  • A custom brand color

  • The organization's logo

  • The data portal's cover photo

  • A header image that will appear in the top left of users' windows when on your data portal

  • Header links to other resources, such as organization-specific documentation, events, and other resources.

  • Rich text and images that will display in the main content section of the data portal. A place to provide high level information about your organization and how researchers should user your data portal.

Want a custom URL for your data portal (e.g., data.university.edu)? and we'll work with your IT to get it configured.

3. Create datasets

You'll need to create datasets to upload and distribute data to your researchers. You can create datasets from the Datasets tab of your Administrator panel.

Here are some pieces to consider before getting started:

Dataset structure

It's worth thinking through your broader data management strategy. Assess the data that you currently have — what format are the files in? Where are they located? How are they organized?

Consider what datasets you'll want to create. Datasets on Redivis are made up of one or more semantically related tables, and each table can contain data from one or more uploaded files. Moreover, all permissions happen at the dataset level — if you grant a user access to a dataset, they will have access to all of its tables.

Citations and DOIs

If you’d like to issue DOIs for your organization’s datasets, you can configure your organization to issue DOIs in the Advanced section Settings tab. DOIs will make it easier for researchers to cite your data when they publish, and allow you to track publications using your data as well as their broader impact.

Learn more in the guide.

4. Build access systems

Before you can configure access to a dataset, you'll want to consider and build out the relevant access systems for your organization as a whole.

Direct access & Member access

If you have a smaller organization where you know everyone working with your data, you might want to consider directly granting access to datasets to specific members, or granting access to everyone who is a member.

This can be set up on any dataset individually or you can create a Permission group on the Permission group tab to save the entire configuration of access levels and assign it to multiple datasets.

Requirement-based access

In many cases, however, you’ll want to develop more process-driven access control systems and gather additional information from your members To do so, you can set up requirements that must be completed by members or their studies, and approved by an administrator, before gaining access.

You can build requirements on the Requirements tab of the administrator panel, and save groups of them together as Permission groups on the Permission groups tab which you can assign to multiple datasets.

Export restrictions

Finally, you may want to control how your data is exported from Redivis, if at all. While members with data access will always be able to work with data in Redivis workflow, you can limit their ability to export data. For example, you may choose to allow exports only to certain environments, or only upon administrator approval, or not at all.

Learn more in the guide.

5. Manage members and studies

As members and studies apply to access data, you may need to take action to approve (or reject) their access requests. You can find all of your pending access requests wherever you see an alert on the administrator panel.

Members

As users join or organization and apply to work with data, they will appear in the Members tab of your administrator panel. You can also invite anyone to become a member by adding them to the members list.

Studies

Studies allow for researchers to organize their work around a conceptual study, or research project, and for you to grant access in the context of that study. Any studies that are working with your organization's data will appear in the Studies tab of the administrator panel automatically, and you can approve any requirements for studies here.

To be alerted to new access requests, you should configure a notification email (and frequency) from the Settings tab of the administrator panel.

Learn more in the guide.

6. Gain insight into usage

Redivis provides tools to help you keep an eye on your organization and how its data is being utilized.

You can filter, sort, and change the columns for any list in your administrator panel, and download the corresponding report for a snapshot of current information. You can also dig into the logs and generate analytical reports for more information.

Next steps

Expand your reach

If you are part of a larger institution, you can about getting an institution-wide Redivis page to help users discover new organizations and datasets.

Geospatial data

Overview

Geospatial data on Redivis behaves quite similarly to : each feature is ingested as a single row within a table, alongside any metadata for this feature. This approach mirrors tools like PostGIS, R spatial features, and geopandas, allowing you to query and join your geospatial data, at scale.

Supported file types

Redivis supports importing geospatial data from several common GIS formats: geojson, shp, shp.zip,kml, and parquet. Parquet files with geospatial metadata (often referred to as GeoParquet) are the most performant and robust option, though as a newer standard, these files are less common. For other geospatial file types, Redivis first converts the file to a geojson representation (using the relevant ), and then imports the geojson into a table.

Each feature will be imported as one row, with the geometry column containing the WKT representation for that feature. Additional feature properties will be mapped to variables in your table, with any nested properties flattened using the . separator. Note that Redivis only supports 2-dimensional, unprojected () geometries. Other projections might cause the import to fail, and any extra dimensions will be stripped during ingest. If you are uploading a .shp.zip that contains projection information, the geometries will automatically be reprojected as part of the import process.

Type
Description
Notes

Geography data in text-delimited files

In addition to uploading geospatial data using one of the formats listed above, you can also import geographic data encoded within a text-delimited file (e.g., a csv). In this case, the geographic data should be encoded as strings using the . This is also the same format used when exporting geography variables as a CSV. WKT in CSVs will be auto-detected during data ingest.

Quotas & limits

Limits for upload file size, max variables, and other parameters are .

Version control

Overview

All datasets on Redivis are automatically versioned. Any change to a dataset's data in or will require a new version. A version is a locked copy of the data, supporting future reproducibility and confidence in the persistence of researchers' data workflows.

Changes to documentation and metadata do not create a new version, though different versions do have independent documentation and metadata. For example, if a new version contains a new table, or new data, you will likely want to document this information separately from the previous version. However, if you only want to enrich the existing metadata on the current version, you can do so without creating a new version.

If you are a dataset editor and have new data, find a mistake to correct, or would otherwise like to modify the existing content of a dataset, you can and a new version.

Semantic version tags

To help researchers better understand if the differences across versions, Redivis uses semantic versioning, of the form v[major].[minor] The first version of every dataset is v1.0. For subsequent versions, the tag will augment automatically depending on the changes being released.

  • Major update: Existing code may not run.

    • Triggered when variables in the new version are renamed, deleted, or retyped. Also occurs if any tables from the previous version were deleted.

  • Minor update: Existing code will generally run.

    • Changes are limited to adding / removing records, recoding variables, adding variables, and adding tables.

Version history

On any dataset page, you can view the current version tag next to the dataset title, and click on this tag to switch to view a full version history and switch to a different version of this dataset.

Within a workflow, you can change the version of a dataset by selecting the and clicking the Version button at top right. If there is a new version available, the dataset node will be highlighted to indicate that you might want to upgrade.

Creating a new version

When it's time to update a dataset's data, you'll want to create a new version. To do this, navigate to the dataset editor and click Create next version.

Before this version is released, it will be tagged as next. Only dataset editors will be able to see the next version on the dataset page and use it in their workflows.

A dataset can have up to 1,000 versions. If your use case exceeds this limit, consider creating a new dataset that imports the previous dataset's tables once this limit has been reached.

Version storage

All versions of a dataset contribute to that dataset's total size, which in turn will count towards your or (depending on whether the dataset is owned by you or an organization).

This total size will be displayed in the dataset editor, alongside the size for the current version. For datasets with one version, this total size may be slightly larger than the current version, as Redivis stores certain metadata behind the scenes to support future versioning.

As new versions are created, Redivis will efficiently compute a row-level difference between the versions — only additions and updates to existing data will contribute to the dataset's total storage size, preventing data that is consistent across versions from being double-counted.

Adding or removing columns (or changing column types) won’t affect row uniqueness, as the underlying storage represents all values as strings. Only the storage size of the new column would be added.

Version unrelease

If the most recent version of a dataset has been released in the last 7 days, and there is no Next version already created, you'll have the option to unrelease it.

This will revert the dataset to the exact state it was in before the version was released. If anyone who is not a dataset editor has this version in a workflow, they will lose access to the data, though can revert to a previous version if it exists.

Unreleased and published datasets

When a dataset is first created it will be marked as unreleased.

As you work on the initial version of the dataset you can see the changes you make to the data in the version history modal. in this unreleased state you can add it to a workflow and analyze the data it contains. Note that the only people who can see it in the workflow are dataset editors. If you make changes to the data or delete it, your workflow will change instantly to reflect those changes.

To add an unreleased or unpublished dataset to a workflow, click the View dataset tab from the edit interface, and click the Analyze in workflow button. Since this dataset is not published it will not appear in the Add dataset interface in workflows.

Once you are ready to make your data available to non-editors, you will need to publish it. This will simultaneously release your version as well as make your dataset discoverable. As soon as it is published it will be accessible to all who meet its access rules.

If this is an organization dataset, it can be unpublished at any time. This will "unlist" the dataset so that it will not be listed on your organization home page and even users who meet the access requirements will not be able to see it. Anywhere it is used in a workflow it will instantly be made unavailable.

Unpublishing might be used if you need to temporarily halt usage of the dataset but don't want to disrupt all of the access rules.

Working with previous versions

When analyzing data in a workflow you can change any dataset to a previous version to instantly change every table to the corresponding table in that version. To do this, click on the dataset and click the Version button in the toolbar. Select the version you'd like to analyze and confirm. You can always switch versions at any point.

Normally you can only have one copy of dataset in a workflow but it's possible to add a second version if you'd like to compare them. In any workflow, click the + Add data button and locate the dataset. Right click on the dataset and choose to add another version to the workflow.

.parquet

GeoParquet

The GeoParquet specification is a modern standard for working with column-oriented geospatial data. If available, this format is the most robust and performant way to ingest geospatial features into Redivis.

.geojson

GeoJSON

Assumes an object with a "Features" property, containing an array of valid geojson features. Each feature will be imported as one row, with additional properties mapped to columns in the table. Nested properties will be flattened using the . separator. Note that Redivis only supports 2-dimensional, unprojected (WGS84) geometries. Other projections might cause the import to fail, and any extra dimensions will be stripped during ingest. See working with geospatial data for more information.

.geojsonl, .ndgeojson .geojsons

Newline-delimited GeoJSON

Same as the .geojson specification outlined above, except each feature is given its own line. Importing .geojsonl (as opposed to .geojson) will be significantly faster.

.kml

Keyhole Markup Language

Will be internally converted to .geojson (via ogr2ogr), and then imported as specified above.

.shp

Shapefile

Will be internally converted to .geojson (via ogr2ogr), and then imported as specified above. Note that the shapefile must use the WGS84 (aka EPSG:4326) projection. If you have additional files associated with your shapefile (e.g., .shx, .proj, .dbf), create a .zip of this folder and import according to the .shp.zip specification below.

.shp.zip

Zipped ESRI shapefile directory

Many shapefiles will be collocated with additional files containing metadata and projection information. These files are often essential to parsing the shapefile correctly, and should be uploaded together. To do so, create a .zip directory of the folder containing your .shp file and supplemental files. The zip file must end in .shp.zip. These will then be converted to .geojson (via ogr2ogr), and imported as specified for the .geojson format.

If projection information is available, the source geometries will be reprojected into WGS84. If no projection information is available, your data must be projected as WGS84, or the import will fail. Note that only one layer can be imported at a time. If you have directory containing multiple shapefiles, create a separate .shp.zip for each layer.

tabular data
ogr2ogr driver
WGS84
Well-Known Text (WKT) representation
specified here

Dataset type

Dataset icon in the middle of the circle

This data source is a copy of a dataset on Redivis.

Workflow type

Workflow icon in the middle of the circle

This data source is a copy of another workflow on Redivis.

Sampled

Black circle with 1% icon

Only possible for dataset source nodes. This means that you are using a 1% sample of the data. When a dataset has a sample, it will automatically default to it when added to a workflow. You can change this to the full sample and back at any time in the dataset node

Outdated version

Purple background on version number

Only possible for dataset source nodes. For datasets this means that you are not using the latest version. This means that you have either intentionally switched to using an older version, or that this dataset's administrator has released a new version that you can to.

Incomplete access

All black background, or dashed borders

You don't have full access the node. Click on the Incomplete access button in the top bar to begin applying for access to the relevant datasets.

datasets
Datasets
samples
version
Workflows
tables
files
create
release
dataset node
usage quotas
organization billing
naming constraints

Transform concepts

Overview

A transform is a series of steps detailing a data transformation that are executed in sequential order when the transform is run. These steps and a final variable selection create an output table that can be inspected and further transformed.

Transforms vs. notebooks?

There are two mechanisms for working with data in workflows: transforms and notebooks. Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.

Transforms are better for:

  • Reshaping + combining tabular and geospatial data

  • Working with large tables, especially at the many GB to TB scale

  • Preference for a no-code interface, or preference for programming in SQL

  • Declarative, easily documented data operations

Notebooks are better for:

  • Interactive exploration of any data type, including unstructured data files

  • Working with smaller tables (though working with bigger data is possible)

  • Preference for Python, R, Stata, or SAS

  • Interactive visualizations and figure generation

Steps

The majority of building a transform is choosing and completing your data transformation steps. Steps represent one specific action (e.g. filter rows, join tables) and are completed in the Redivis transform interface. Every step is a fundamentally a piece of SQL code that can be examined and could be reproduced by running in any environment with the same data.

When running the transform, steps will be executed in order, building off of what directly came before them.

Steps have additional options in the ⋮ menu to help in the building process:

  • Collapse / expand

  • Annotate

  • Disable / enable

  • Reorder (this might change the outcome)

Output variables

The final task in constructing a transform is to select which variables you want to have in your output table. You will use the variable selector in the bottom pane of the transform to choose which to keep.

Running transforms

Transforms are iterative and are designed to easily check your progress as you work. As long as a transform is valid, you can press the Run button to generate your output table. Click on this output table node in the workflow tree to see your results once the run is complete.

Invalid state

The transform interface will prevent you from running a transform that is invalid. This will be displayed with an invalid icon on the step or the part of a step where the issue is. Usually when you have an invalid alert it means some crucial information is missing and you can follow the invalid icons to find the locations you'll need to update.

History and revert

Every time you run the transform, a snapshot of the transform at that point in time is saved. You can click on History to view the log of these runs, and click on any snapshot to revert the transform to that point in time.

Your transform will then be in the Edited state and no changes will be made to the output table. Once you run it, this configuration will become a new snapshot at the top of the transform history list.

Checking the output

Output tables don't just exist to store data, you can use the variable statistics generated in the table node to make sure your transform did what you expected it to do.

Common things to check might be total number of rows and variables in your output table, distinct and null counts of a key variable, or even running a query on the Query tab of the table to filter by a particular value.

If you find the table isn't what you want, you can go back to the transform to make changes and rerun it. Transforms were created to be iterative!

Additional tools for working with transforms

Name and rename

By default transforms are named with a number based on the order they were created. We suggest renaming your transforms with descriptive names so you and your collaborators can quickly find them later on. You can rename a transform by clicking on the title of the node, or on the Rename option in the transform ⋮ More menu.

If you have not specifically renamed the connected output table, renaming a transform will also rename this output table.

Change the source table

All transform nodes have one source table. You can join in multiple tables, but the source table designation will affect how the joins execute. You can change the source table from the ⋮ More menu in the transform. Note that if the new source table has different variables your transform might become invalid until you update it.

Split and combine transforms

All transforms can be split at the step level into two different transforms by clicking Split in any step's menu. Additionally, two transforms can be combined into one by right clicking on a table to Remove it.

You might want to split a transform above a tricky step to see what the output table would look like at that point in the process. This can be a key tool in troubleshooting any issues and understanding what might be going wrong.

After splitting a transform to check an output table, the next logical step might be to combine these two transforms back into one again. Or perhaps you have a string of transforms which you no longer need the output tables for and want to reduce the size of your workflow.

Copy, paste and insert

As you go, you might find yourself wanting to take certain data cleaning actions multiple times or move them around in your process. You can right click on any transform in the workflow tree to see options for copying it or deleting it. Right click on any table node to paste the copied transform. You can also insert transforms above other transforms.

Steps and parts of steps can also be copied and pasted within the same transform or across multiple transforms.

View SQL code

All transforms generate SQL, which will executed the transform is run. To view the code for a particular step, click on View SQL in the step menu. You can also see the SQL for the entire transform from within the ⋮ More menu of the transform.

You can convert this code into a SQL query step and run it to achieve the same outcome, or edit the code directly.

SQL Concepts

Redivis uses SQL as the basis for data transformation for many reasons. It is designed for working quickly and efficiently with large tables, and is easy to learn concepts and basic executions, and it is highly reproducible.

While Python, R, Stata, and SAS are commonly used in data analysis, they are not suited to working with large tables. Executing queries on millions of records can take hours and in many cases fail. We have provided a point and click interface for using transforms to make using a perhaps unfamiliar coding language easier.

Fundamentals

  • Tables are data structures containing rows and variables

  • Variables represent fields in the table that all rows contain values for

  • Rows represent an individual observation in table with values in across all variables

  • A SQL query will always operate on data from one or more tables and output a single data table.

Building a query

When you are using the interface or SQL code, it is best to start the process by figuring out what you would like your output table to look like. What are the variables? What do you want each row of the table to represent?

Once you figure out what you want your output to look like, you can work backwards and formulate the steps of your query to build the shape you'd like.

Fine tuning a Large Language Model (LLM)
contact us
Contact us
Create & manage datasets
Configure access systems
Grant access to data
contact us

Configure access systems

Overview

One of the main tasks of organization administrators is to manage who has access to datasets. Redivis has tools to support any level of complexity in data systems for your restricted data and makes it easy for organization administrators to define, manage, and audit access rules across these various use cases.

1. Consider your organization's access needs

Do you mostly have public datasets? Will you have a small group of researchers with access to pretty much every dataset? Will you need to gather information from researchers before they use your datasets? Do you have high risk data that can't ever leave the system? Or some combination of the above?

Redivis supports access systems for all of these configurations – for organizations of varying sizes – but you'll want to use different tools to achieve these very different goals.

For example, if you have few datasets, you might find that permission groups for bulk managing access rules aren't particularly helpful. Or if you have a small group of researchers who you all know closely, you might not need to set up process-driven access rules through requirements, and can instead grant access to certain researchers on an individual basis.

We'll walk through all the tools available to organizations for managing access, but you'll likely want to pick and choose from what's available to meet your organization's specific needs.

2. Understand access levels

All interactions with data on Redivis require the user to have the appropriate access level to the data for a given action. Ideally your data would be as permissive as possible to allow for the greatest exploration from researchers before they need to start applying for access.

Dataset access has five levels:

  1. Overview: the ability to see a dataset and its documentation.

  2. Metadata: the ability to view variable names and univariate summary statistics, but not to retrieve any identifiable information or multivariate relationships from the data.

  3. Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.

  4. Data: the ability to view and query a dataset's tables, and work with them in workflows.

  5. Edit: the ability to edit the dataset and release new versions.

Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.

We strongly recommend making your dataset's metadata as open as possible. This will reveal variable names and aggregate summary statistics, but will not allow researchers to view, query, or export the raw data in any way.

Being able to see metadata greatly improves researchers' discovery experience, and allows them to better assess a dataset's utility upfront, and even reduce your administrative workload. If researchers can understand a dataset before applying for access, they'll be submitting fewer access applications to datasets that are ultimately a dead end.

Learn more in the Access levels reference section.

3. Understand available access tools

Membership

Anyone wanting to apply for access to restricted datasets in your organization must first be a member. You can configure whether memberships are restricted to certain identity providers (such as your institutional login), and whether they are approved automatically or require administrator review. You also have the option to configure access to datasets to "All members."

Direct access

Permission granted directly to a researcher to instantly gain access to a dataset at a specific level. Researchers can also request access to datasets with this configuration.

Example usage: a dataset that will only be shared with a small number of people who are already known to administrators

Member requirements

A form for members to fill out. This can be set to require approval from an administrator or be automatically approved. It can also have an expiration date. These are global to your organization and when assigned to multiple datasets a user will only fill it out one time.

Example usage: a demographic form gathering researcher personal information, or a data use agreement signed PDF

Study requirements

Similar to requirements, but instead of each user needing to fill them out individually, only one requirement needs to be completed for the entire study, which can include multiple researchers. A single researcher may also have multiple studies. Each study working with the dataset will need to fill out its own study requirement, and any queries or exports of the data will be tied to that study.

Example usage: a funding proposal for a research project

Data export restrictions

A rule defining that a dataset can only be exported to a specific export environment, as configured on the Administrator panel Settings tab.

Example usage: limiting exports to a specific server environment

Learn more in the Configuring access reference section.

4. Create requirements

If you want to work with requirements, you'll want to get started making them and planning out how they will work across datasets.

Perhaps you want one requirement for all members to fill out about their field of research, which is necessary to gain access to any of your datasets, but another 4 requirements with different data use agreements that will apply only to their specific datasets.

To get started, go to the Requirements tab of the administrator panel and click the New requirement button. You will need to start by selecting if this will be a Member requirement or a Study requirement.

You can use the form builder to collect different information, including standard form responses, file uploads, and e-signatures.

Learn more in the Requirements reference section.

5. Create permission groups

A permission group is an access configuration that can be assigned to multiple datasets and managed centrally.

You don't need to use permission groups, and it might not make sense to do so if each of your datasets has a different access configuration and you aren't using requirements. But if you have any overlap between datasets and want to enforce consistency, or want to use requirements you'll want to make one.

To get started, go to the Permission groups tab of the administrator panel and click the New permission group button.

This interface requires you to set an access paradigm for each access level of the dataset.

Perhaps you will set the overview access level to be public, the metadata access level to be available to all members, and data access level to be direct access (meaning you will have to directly grant access to users, or respond to their requests for access).

Or perhaps you will set the overview access level to public, and assign multiple requirements to the metadata and data access levels (meaning that anyone who is approved for all of the requirements will automatically gain that access level).

For any case you can assign data export restrictions here and choose whether you want to manage access to the dataset's sample (if it exists) differently than the full data.

If overview access to a dataset isn't public, non-approved users will not be able to see the dataset or its name in any way. In some cases, this may be the intended behavior, but remember there will be no way for researchers to apply for these datasets.

Instead, for these hidden datasets, an administrator will need to first explicitly grant overview access before researchers can view the dataset and request further access.

Learn more in the Permission groups reference section.

6. Assign access permissions to datasets

Finally you'll need to apply these access permissions to actual data!

Open any of your datasets and click the Configure access button on the top of the page. This configuration setup will look very similar to configuring the permission group.

You can either create a custom configuration here, or you can assign this dataset to one of the Permission groups by clicking the dropdown menu in the top right corner of this modal.

This is also where you will manage any direct access requests for the dataset.

7. Verify your setup

As an administrator of this organization, you will have access to all datasets no matter what your access configuration is or what requirements you have filled out.

We recommend checking that your access system works as you expect by either looking at a dataset while you are logged out or in your browser's incognito mode, or by making a second Redivis account using a different email address that is not linked to your administrator account.

Next steps

Grant access to data

You have some shiny access systems, but they won't work if you don't approve user requests for access.

Learn more in the Grant access to data guide.

Expand your reach

If you are part of a larger institution, you can contact us about getting an institution-wide Redivis page to help users discover new organizations and datasets.

Workspace

Overview

The workspace is the home of your content on Redivis. It is created when you make a Redivis account and accessible by clicking the My workspace button in the right of the header bar from anywhere on the site. Your workspace is only visible to you and is where you will manage your work on Redivis.

Your workspace dashboard has links and information to help ground yourself in the work you are doing on Redivis. Beyond the dashboard, you will see pages listed on the left bar with different aspects of your account.

Workflows

Any workflows that you've created, or that have been shared with you, will show up here. You can create a workflow by clicking the + New workflow button on this page in order to start working with data. Clicking on a workflow here will open up the workflow interface.

Datasets

Any datasets that you've uploaded or can edit will appear here. You can create a dataset by clicking the + New dataset button on this page in order to start uploading and sharing your data. Clicking on a dataset in this list will open the dataset editor which will allow you to work on the dataset.

Organizations

This page lists all organizations that you're a member of. If you have applied for access to any restricted data, a copy of your submissions will be listed under the organization you applied to. You can search for new organizations to join by clicking the + Find new organization button.

Studies

Studies allow users to more easily work in groups. You can create a study by clicking the + New study button on the studies tab of your workspace. Any study you are a part of will appear in your workspace. Any edits you make will appear for all study collaborators.

Dataset library

This section contains any dataset that you have bookmarked. You can manually bookmark a dataset by clicking on the bookmark button next the dataset title on any dataset page. Redivis will also automatically bookmark datasets that you add to prjoects, or where you have applied for access. You can remove these at any time by right clicking on a dataset in this list and selecting the Remove from library option.

Logs

All activity you take when working with data on Redivis is logged. You can visit the Logs page of your account to view the details of your Query, Notebook, and Export actions.

Settings

Public profile

You can specify the following information, public to all users and organizations on Redivis:

  • Name: This is your full name displayed across Redivis. This will appear when collaborators search for you to share their workflows, and the name administrators will see when interacting with any access steps you've submitted. It does not need to be unique.

  • Username: A unique handle for your account.

  • Profile picture: A picture visible next to your name throughout the platform.

  • ORCID iD: Link your ORCID ID so organizations can see research you are working on. See orcid.org for more information on setting up an account.

  • Disciplines: Areas of interest and study that best represent your work and research.

Changing your username

If you would like to change your username, you may do so at any time. Any URLs or API references to your old username will automatically redirect. You may change your username up to 10 times.

Authentications

You can use any email listed here to log in to your account. If you have more than one email address, the one marked as 'Primary' will be the one to receive any emails. You can authenticate with additional institutional or Google emails by clicking Add authentication button. If the email you add is associated with an institution or organization, the authenticated information we receive from them will be listed here. A single authentication will be used by organizations to verify your identity when you apply for membership.

Data sources

To allow you to import your data from wherever it resides, Redivis supports multiple integrations with outside services.

Redivis will only ever read data from these sources; it will never modify or overwrite content. Additionally, Redivis will only access and transfer content when you specify a location in one of those sources as part of the data upload process.

API tokens

You can create and manage tokens here for working with the Redivis API. See the API documentation for more information.

Compute credits and billing

You can use compute credits to purchase advanced compute environments for notebook nodes within workflows.

Secrets

Secrets allow for you to define simple key/value pairs that are securely stored and later be referenced from within a Redivis notebook.

Common use cases for secrets include storing external API tokens and other credentials for systems outside of Redivis.

Secrets are only accessible to you from within a Redivis notebook.

Learn more about working with secrets in notebooks >

Advanced

This tab allows configuration of the following:

  • Contact email: Specify a contact email to receive email notifications for any update to your data access applications. If you apply to access data hosted by an organization, that organization will be able to see your contact email. By default, this is the email you used initially to create your Redivis account.

  • Communications: Allows you to configure email notifications for any update to your data access applications, as well as occasional product update emails regarding changes to Redivis. Note that, if you opt to receive email notifications, you will still see a comprehensive list of notifications in the web platform.

  • Security: Shows all current sessions of Redivis across devices and browsers, and provides the ability to log out of any of these sessions.

  • Delete account: Allows permanent deletion of your account, including all workflows you have created.

Note you cannot delete your account if you are the only remaining administrator of an organization, or if you still have any non-deleted datasets.

Stata license

In order to enable Stata notebooks for your account, you can provide your Stata license information here (alternatively, you'll be able to use Stata if one of your organizations provides a license). Specifically, you'll need the license "code", "authorization", and "serial number", which should all have been provided as part of purchasing Stata.

SAS license

In order to enable SAS notebooks for your account, you can provide your license information here (alternatively, you'll be able to use SAS if one of your organizations provides a license). Because of how SAS handles cloud deployments, the steps to enable SAS are a bit more complicated than for Stata. Specifically, you'll need to complete the following steps:

  1. Run SAS Deployment Wizard to install SAS Studio on a supported Linux 64-bit operating system. During the installation, change the default location for the SAS Studio installation to /usr/local/SASHome.

  2. For details about how to install SAS Studio, refer to the installation and configuration instructions provided by the product.

  3. Create a TAR file that includes the SASHome directory:

    tar -czvf SASHome.tar.gz /usr/local/SASHome

Once this tar file has been created, please reach out to [email protected] to share the file so that we can enable SAS for your account.

Select first/last encounter

E.g.: I want the date of the first and last hospital admission for every patient in my table

Let's take an example table which contains all hospital visits in one facility over many years. Each record contains information about what happened in the visit, including a patient id to identify who was present. An individual patient might be represented in one or many records in this table since each record represents a separate hospital visit.

patient_id

encounter_date

1

2012-01-01

1

2011-01-01

1

2010-01-01

1

2013-01-01

2

2009-01-01

2

2008-01-01

2

2015-01-01

3

2014-02-01

Let's say that for our research we want to find the first and last encounter — or more formally, the min and max encounter date for each patient.

Computing a min or max value is an aggregate operation that scans all records and chooses the highest or lowest value. To compute the min or max value in a variable we will want to use an analytic new variable method. We can do this either by aggregating our table (which would drop records and/or variables) or by using a partition to calculate a new variable without changing the rest of the data.

In this example we will use a partition to create a new variable since we want the rest of the data to remain unchanged.

Variable concepts

Conceptually we will define our aggregation variable as the date of the encounter since that is where we want to look for the min and max values.

We will define our partition as the patient id since we want to find the min or max for each patient. This will limit the min or max value scan to all values of our aggregation variable across each unique value of our partitioned variable (patient id).

You can see more examples of using partitions with analytic methods on the Create variables step page.

Defining the new variable

For this particular case, we want to find the MIN and MAX of the encounter_date of each patient; in other words, partitioned on patient_id . This will create two new variables, min_encounter_date and max_encounter_date for each patient.

When we run this query, note that no records will be dropped; rather, these new variables will be have consistent values for any given patient_id .

patient_id

encounter_date

min_encounter_date

max_encounter_date

1

2012-01-01

2010-01-01

2013-01-01

1

2011-01-01

2010-01-01

2013-01-01

1

2010-01-01

2010-01-01

2013-01-01

1

2013-01-01

2010-01-01

2013-01-01

2

2009-01-01

2008-01-01

2015-01-01

2

2008-01-01

2008-01-01

2015-01-01

2

2015-01-01

2008-01-01

2015-01-01

3

2014-02-01

2014-02-01

2014-02-01

Now, let's say we only want to keep records which contain each patient's first and last encounter. We can do this easily be creating a new row filter:

We can also generalize this approach to find the Nth encounter for each patient — take a look at the RANK, DENSE_RANK, and ROW_NUMBER methods.

And if we want to be more specific on our partition (e.g., for the first / last encounter in a given calendar year), we can always apply additional variables (year ) when defining our partition.

Step: Order

Overview

The Order step sorts the table based on one or more variables.

Example starting data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Order by score descending

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | sam     | 74     |
 | pat     | 62     |
 | neal    | 35     |
 *---------+--------*/

Ordering large tables (>1GB) across variables with a large number of unique values requires substantial memory and isn't parallelizable. Order clauses in such cases may significantly slow your transform or cause it to fail.

Step structure

  • There will be at least one order block where you will define a variable and a sort order.

  • When multiple blocks exist, the variables will be ordered and then sub-ordered in sequence.

Input field definitions

Field
Definition

Order by

The variable containing the values that will be sorted.

Sort

A choice of how all values in the Order by variable will be sorted: ASC (nulls first) ASC (nulls last) DESC (nulls first) DESC (nulls last) Note that variables of the geography data type aren't sortable.

Examples

Example 1: Basic order

We can sort a table to quickly see the highest scores.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 100   | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 100   | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields

  • Order by: The score variable has the data we want to sort on, so we select it here

  • Sort: We want the data to go from smallest to largest values, so we choose ASC. There are no null values in this table, so we can choose either nulls first or nulls last and get the same result.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | midterm | 100   | sam     | 2020-05-01 |
 | final   | 100   | sam     | 2020-06-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 83    | jane    | 2020-04-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 *---------+-------+---------+------------*/

Example 2: Ordering on multiple variables

Lets say instead we first wanted to sort first by year, then by the the lowest sales number.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 100   | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 100   | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

First block

  • Order by: The first variable we want the data sorted on is score so we choose it in the first block.

  • Sort: We want the earliest information first, so we know we want the information to be ascending. This variable has a null value so it matters whether we want nulls to appear first or last in the order. Since we want it last, we choose ASC (nulls last).

Second block

  • Order by: The second variable we want to sort on is date so we put it here.

  • Sort: Since we want the most recent (highest) values first, we want it to be descending. There are no null values in this variable so where we put the nulls does not matter. We choose DESC (nulls first).

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | final   | 100   | sam     | 2020-06-01 |
 | midterm | 100   | sam     | 2020-05-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 83    | jane    | 2020-04-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 *---------+-------+---------+------------*/

Analyze data in a workflow

Overview

Workflows are where you work with data on Redivis. In a workflow you can query, merge, reshape, and analyze any data that you have access to, all from within your web browser.

In a workflow, you can construct reproducible data transformations and analyses, and share and collaborate with your peers in real time.

1. Create a workflow

Add a dataset to a new or existing workflow from any Dataset page where you have "Data access" by clicking the Analyze in workflow button.

Within workflows you can navigate between entities on the left side of the screen, and inspect them further on the right panel. You can inspect your dataset further by clicking on any table to see its cells and summary statistics.

To add more data to this workflow you can click the Add data button in the top left of the workflow toolbar. It is also possible to add other linked workflows to this workflow. This is useful as you develop more complex analyses that you want to segment into discrete pieces of work that you can link together.

You can find this workflow later by going back to your workspace.

2. Transform data

Transforming tables is a crucial step in working with data on Redivis. Conceptually, transforms execute a query on source table(s), whose results are materialized in a new output table. In most cases you'll want to use transforms to reshape your data to contain the information you're interested in, before analyzing that table in a notebook or exporting it for further use.

To create a Transform, select a table in this dataset and click the +Transform button. You can get started here building a query through the point and click interface or writing SQL code by adding a SQL step.

For all transforms you will need to select which variables you want to keep in your output table. The rest of the steps are up to you. Some common operations you can get started with include:

  • Joining in any other dataset table or output table in this workflow

  • Creating variables

  • Filtering records to match defined parameters

  • Renaming variables or changing their type

  • Aggregating data

Once you've built your query, execute it by clicking the Run button in the top right of the transform. This will create a new output table where you can inspect the output of your query by clicking on the table beneath the transform in the map and making sure it contains the data we would expect.

From here you can create a new transform from this output table to continue reshaping your data, or go back to your original transform to make changes and rerun it.

As you become more familiar with transforms, you can start doing more advanced work such as geospatial joins, complex aggregations, and statistical analyses.

Learn more in the Reshape data in transform guide.

3. Analyze data in a notebook

Once you have a table you're ready to analyze, you can select any table and click the + Notebook button to create a notebook that references this table.

Notebooks are available in Python and R, as well as Stata or SAS (with a corresponding license). Notebooks come pre-installed with common libraries in the data science toolkit, but you can also customize the notebook’s dependencies and startup script to create a custom, reproducible analysis environment that meets your needs.

The default notebook configuration is free, and provides access to 2 CPUs and 32GB working memory, alongside a 60GB (SSD) disk and gigabit network. The computational powerful of these default notebooks are comparable to most personal computers, and will be more than enough for many analyses.

If you're working with larger tables, creating an ML model, or performing other particularly intensive tasks, you may choose to configure additional compute resources for the notebook. This will cost an hourly rate to run based on your chosen environment, and require you to purchase compute credits on your account.

Notebooks come pre-populated with some starter code you can use to import data, and the API docs contain comprehensive documentation and further examples.

From here it’s all up to you in how you want to analyze and visualize your data. Once you’ve finalized your notebook, you can easily export it in different formats to share your findings!

To learn more about analyzing data, see our Work with data in notebooks guide.

4. Share and collaborate

You can share your in-progress work or finished results with collaborators by sharing this workflow.

Researchers can work side by side in this workflow in real-time. Leave comments to communicate, and see a visual cue for what each person is working on. You can even collaborate within a running notebook at the same time.

If any of the data in your workflow is restricted, your collaborator must also have access to those datasets in order to view their derivatives within your workflow.

Next steps

Share and collaborate

Redivis workflows are built for collaboration and include real-time visuals to see where collaborators with edit access are working in the workflow, and a comments interface to discuss changes asynchronously.

Share your workflow to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.

Export data

If you'd like to export data to a different system, you can download it in various file formats, reference in Python / R, or visualize in tools such as Google Data Studio.

Learn more in the Export to other environments guide.

Browse our example workflows

Redivis workflows excel at working with large tables, whether it's filtering and joining, complex aggregation and date manipulation, or visualization and analysis.

Learn more in the Example workflows guide.

Other

Cast

Converts the type of a variable. Consider using the "retype" step for additional functionality. –> learn more

CASE WHEN @expression IS NULL THEN NULL ELSE COALESCE(SAFE_CAST(@expression AS @castType), ERROR(FORMAT('Could not cast @expression to @castType, encountered value: %t', @expression))) END

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

@castType

any of: INT64, FLOAT64, STRING, TIME, DATETIME, DATE, BOOLEAN

true

(Choose a type)

@safe

any

true

-

Coalesce

Takes the first non-null value of a set of values. –> learn more

COALESCE(@expression)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

s or s

any

true

-

Constant

Create a constant value as a variable –> learn more

undefined

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@literal

or

any

true

-

Hash

Returns an MD5 hash of all values as a base64 encoded string. Non-string values will be first coerced to strings. Note that order of inputs will affect the hash. –> learn more

TO_BASE64(MD5(CONCAT(@expression)))

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

s or s

any

true

-

Sample

Create a random, deterministic value in the range of [0, 1) based on a specific set of variable(s). –> learn more

(FARM_FINGERPRINT(CONCAT(@variable)) + POW(2, 63)) / POW(2, 64)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

s

any

true

-

UUID

Generate a random universally unique identifier (UUID) –> learn more

GENERATE_UUID()

Return type

string

Parameters

Build your own site with Observable

This guide outlines an example hosted by the Redivis team. Follow along at the to see all the details and build your own!

Overview

It's easy to use of Redivis data anywhere on the web, either by via our client libraries, via one of our export integrations, or directly in another webpage.

To present Redivis data in a fully customizable way, you may want to build your own dashboard-style site. The following example shows a static site built with , deployed via , that connects to Redivis data using our client libraries.

To try a lighter-weight "dashboard" for analyzing and visualizing your data, you can use integrated . With notebooks, you can work directly in your Redivis workflow alongside relevant source data, leverage a breadth of techniques and libraries available in the Python, R, Stata or SAS ecosystems, and easily maintain a central asset to annotate, collaborate, and share.

1. Identify your Redivis data

Identify the data sources on Redivis that will provide the content to build your site. You may be interested in presenting facets of a already hosted on Redivis, highlighting tables you yourself, or showing outputs generated in one of your . In our example, we'll use weather station locations contained in the , and outputs from a that uses the same source data, to visualize locations and aggregate precipitation measurements from around the world.

To reference a Redivis table, choose the "Export table" option and navigate to the "Programmatic" tab (e.g., the ), and see the code snippet for unique identifiers for the owner, dataset/workflow, and table. More details about here.

2. Build your Observable project

With chosen data content in mind, the next step is to choose a set of tools to build our custom dashboard and deploy it to the web. Of the numerous choices, provides an approachable option with support for working with data in many programming languages (not just javascript), powerful visualization libraries built in, and excellent documentation. To start with a copy of our example dashboard, you can clone our public and follow the README.md instructions to develop a similar site in a local environment. To create a brand new project, and to reference additional details and development strategies, see the Observable guide.

3. Load data with a client library of choice

You'll need to generate a Redivis API access token with appropriate permissions. See in our API documentation to create your own token to read data from Redivis.

To successfully authorize data fetching functionality in your development environment, be sure to export a REDIVIS_API_TOKEN variable to your path, with the following terminal command: export REDIVIS_API_TOKEN = 'MY_API_TOKEN'

IMPORTANT: API access tokens operate like passwords, and can allow another party to access resources on your behalf.

You should never share tokens, and avoid committing to source control where collaborators may have access (either now or in the future). See the section below for details on how to use secrets to store a Redivis API token necessary for your project.

Writing a "data loader"

In Observable, a is a file that accesses data and writes it to a standard output channel for consumption by renderers, to be ultimately displayed on your site.

In our example, we'll illustrate use of both javascript and python to pull data from Redivis into our dashboard, using the via js and python client libraries.

Using the redivis-js client library

In a web environment, the allows for fetching and manipulating data via simple javascript functions.

In our example, we first install the library by specifying the in our package.json and running npm install in our development environment to install or update all specified packages.

Then, in the Observable data loader file , we authorize access, and then listVariables and listRows from our specified precipitation table.

Using the redivis-python client library

In a python environment, the allows for fetching and manipulating data via simple python functions.

Using the redivis-python client library requires that you develop your web application in a virtual python environment. Observable documents this process – essentially, create an environment with , activate it on your machine, and install relevant packages to the environment.

In our example, we'll create a requirements.txt file that specifies the of the client library, both for development and future deployment. Running the following command in your virtual python environment will install pip, and then any package specified in the requirements.txt file.

Then, in the Observable data loader file , we create a from our specified stations table, and output the records.

4. Visualize data

With data loaded and parsed into JSON format, it can now be manipulated and presented as you see fit. Among a wealth of modern web visualization tools is , which provides a polished selection of simple visualizations and corresponding functions to group and filter data, and pairs nicely with our development framework.

In our example, we use several Observable Plot functions to display our data on a map – using  with to show a globe – and in a hex-binned histogram – using and for data aggregation. We also add a to allow viewers to rotate the globe visualization.

Then, we use the built-in command to print the JSON-shaped payloads, to give the viewer a quick look at the raw data.

5. Deploy and share your site

With our dashboard finalized, you can deploy your project anywhere on the web. Observable provides a quick command (npm run build) to generate static files to be hosted on a server of your choice, as well as a deployment service if you'd like to host with Observable.

Deploying to Observable

You can run npm run deploy in the command line, and follow the prompts to your project to an address on Observable. Further configuration and permissioning is available through your Observable account.

Deploying to Github Pages

Github Pages provides a hosting service that's easy to integrate with your Github repository and any Github Actions needed to deploy your project.

In our example, we deployed our project to Github Pages in a few simple steps.

First, we need to specify a REDIVIS_API_TOKEN to support Redivis client library authorization in the live deployment. We used "action secret" within the Settings of the Github repo, which sets an environment variable REDIVIS_API_TOKEN to our appropriately scoped token, which is completely private while our repository remains public.

Next, we write a Github Action to build and deploy the project, which will run anytime we push to the main branch of our repository. The action is specified via the file, and relies on a set of open-source actions and our own custom code to do the following steps:

  • Checkout the code

  • Set up a Node.js environment, with a specific node version (20.6)

  • Install our python dependencies (via pip,specified in requirements.txt)

  • Build the static files for the site

  • Deploy the files (in the generated dist folder) to Github Pages via a specific branch (gh-pages)

After adding this file to your Github repository, any push to the main branch will trigger a deployment following the steps above, the status of which can be viewed under the tab of the repository.

Finally, Redivis has hosted a number of different team projects at a custom endpoint specified in our organization's Github Pages settings, which you can see at .

variable
Redivis type
enum
boolean
boolean
variable
literal
Redivis type
variable
literal
Redivis type
variable
literal
Redivis type
variable
Redivis type
python3 -m pip install --upgrade pip
pip install -r requirements.txt
static site
Observable on Redivis Github repository
interfacing programmatically
visualizing
embedding a table
Observable Framework
Github Pages
Redivis notebooks
dataset
uploaded
workflows
GHCN Daily Weather dataset
public Redivis workflow
GHCN Daily Weather "Stations" table
referencing Redivis resources
Observable Framework
Github repo
Getting started
Authorization
Deploy and share your site
data loader
Redivis API
redivis-js client library
latest version
redivis-js-precipitation.json.js
redivis-python client library
here
uv
latest version
redivis-python-geodata.json.py
pandas dataframe
Observable Plot
Plot.plot()
Plot.geo()
Plot.dot()
Plot.hexgrid()
range input
display
deploy
deploy.yaml
Actions
labs.redivis.com
Observable interactive visualzations
Raw data used to build the above globe and histogram plots

No-code visualization

This guide demonstrates using a Redivis workflow to gather key variables from different tables, clean them, and and consolidate them into a single table for analysis in Google Looker, all using point and click interfaces.

Workflow objective

We'll use Gini coefficient data (a common measure of income distribution) from the American Community Survey (ACS) to build a map that shows the range of income inequality by US county in 2018.

This workflow is on Redivis! We also suggest you recreate this workflow as we go to best learn the process.

1. Explore data and start a workflow

We have already uploaded all relevant data for this workflow into a single dataset in the Redivis Demo organization called Example Workflow Dataset which contains contains three tables:

  • ACS Gini coefficient table

  • ACS table that measures population by county

  • National Weather Service dataset that maps a US county to lat/long coordinates.

Add this dataset to your workflow and click on the dataset node to see all of the available tables. Select Gini Index by County and can look at the data format by clicking Cells.

2. Clean data

We want to map the Gini estimate and population by US county in our final data visualization. To map data points by county, we will need a variable that represents the latitude, longitude coordinates of that county. We also want the state, county name, and five-digit county code.

Therefore, we want our final dataset to include six distinct variables:‌

  • Gini Estimate

  • State

  • County Name

  • County Code

  • Latitude, Longitude

  • Population

We will get Lat/Long and Population by joining our table with other tables later on in the workflow. For now, we can create a transform to start reshaping our data.

Rename: Gini Estimate

We already have this variable in our dataset: B19083_001E. We will create a Rename step to rename this to gini_estimate so we can keep better track of it.

New variables: County and State Names

We want to break up our NAME variable into two separate columns, county and state. So we will add the Create variables step and use the Regexp extract method.

This method allows users to create a new variable with characters that match a specific pattern of the old variable. For example, creating a new variable for county, we can select all characters in NAME preceding the comma. We can add a new block and do a similar process on the NAME variable to get the State.

New variable: County Code

The last five digits in the GEO_ID variable represent the county code. We create another new variable block and name our new variable County_code. Select the Substring method and select all values starting at index 10 up to a max length of five.

Variable selection

Finally, we choose which variables to keep or discard in our transform. We’ll keep all variables except for GEO_ID, NAME (county and state combined), and B19083_001M (the margin of error for the Gini estimate).

Click the run button in the upper right-hand corner to run this transform.

3. Sanity check the output table

Click on the newly generated output table to investigate the output of the transform. We can see that this table has the same number of records as our initial table and has the four variables we selected. If we click on the County and State variables we can see in the frequency tables that they look like we wanted them to.

Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.

4. ‌Join geographic information

We now have a table with the Gini estimate, state, county name, and county code, but we need the latitude and longitude information in order to map each county and we also want to account for the size of each county with a population variable. To do this, we'll perform two joins.

We could continue to work in our initial transform but we are choosing to create new transforms for each join to keep steps separated conceptually and provide output tables to reference along the way.

Matching county to latitude, longitude data

First, we will match each county code to its lat/long coordinates by joining with the "County to Lat/Long Coordinates" table.

Create a new transform and add a Join step.

A Left Join returns the complete source table (Gini Index by County) and all matching records from the right table (County to Lat/Long). We set the County_code variable in the source table equal to the County_code variable in the right table, which matches a set of lat/long coordinates to each county code (if there is no coordinate, it will return Null).

When you set two variables equal to each other in a Join, they must be of the same type. If, for example, you set a string type variable equal to an integer type variable, you will have to retype the string as an int (or vice versa).

Some county codes, however, map to more than one set of latitude, longitude coordinates so we must create two new variables in this transform: unique_lat and unique_long.

For unique_lat, we will partition on all kept variables except LAT and take the average of the LAT variable. This says that for that each county, compute the average of all possible latitudes and store that average in the variable unique_lat.

We will select all variables from the source table as well as unique_lat and unique_long in our output table. Finally, we select "distinct" in our variable selector so that we will drop all records that are unique.

Running this transform outputs the following table, where each county now corresponds to a unique set of latitude and longitude variables.

Transform #2: Output‌

Matching county to population data

We now have all the desired variables in our dataset except for population per county, which we will need when we create the final visualization.‌

Create a new transform and add a Join step.

Using the same steps from the previous join, we will perform a second join to match each county to its total population by incorporating the Population by County table (also in the original dataset). We select a Left Join and set the County_code variable in the source table (Gini Index by County) equal to the County_code variable in the right table (Population by County).

Selecting all variables to remain in our output table and running this transform outputs the following table.

5. Final data cleaning

We will be using Google Looker Studio to visualize this data so we will need to update our data to make it work smoothly with their specified format for geographic data. Using their reference sheet, we can see that we will need our latitude and longitude information in the format:

Comma separated latitude and longitude decimal values (e.g., "51.5074,-0.1278" specifies to London, England)

We will need to combine our two separate values and add a comma between them.

We can do that using the Concat method for creating a new variable, but that method only accepts string inputs so we'll first need to retyping the unique_lat and unique_long variables from float to string using the Retype step.

Then we use the Create new variables step to make a new variable named lattitude_longitude and use the Concat method to concatenate the unique_lat string, a string with a comma, and the unique_long string.

We can now discard unique_lat and unique_long and keep the combined variable, latitude_longitude.

Running this transform yields the final table. From here, we can edit variable metadata and/or download our new dataset in the format that we'd like.

6. Export data to Looker Studio

The next step would normally be to create a notebook in this workflow and use Python, R, or Stata to analyze this table. However if we want to create a quick visualization and aren't familiar with any of those coding languages we have easy options to export data to other systems such as Google Looker Studio.

For this example we will link this table directly to Google Looker Studio by clicking Export table on this table.

We could download our final table and the re-upload it into Looker Studio, but we choose to link the table through the Redivis connector so that if we come back to this original workflow and make changes they will be reflected in the visual that we're about to make there.

Follow the prompts of the connector to log in and authorize access, then we'll need to indicate the table we'd like to use. For that we'll need:

  • The owner of the workflow's ID (shown in the top menu bar of the workflow)

  • The name of the workflow (shown in the middle of the black workflow toolbar)

  • The name of the table (shown on the title of the table node)

Note that the connector won't recognize some symbols (such as :) so you might need to update one of them if there is an issue connecting.

When the table is being imported we will have the option to change the type for variables. We'll need to change the type of our lattitude_longitude variable from string to the Lattitude, Longitude geography type.

Then click Create report to get started!

7. Build a visualization

For this example we are going to build a bubble map that shows the size of population against the intensity of gini disparity in certain regions.

Get started by adding a chart and selecting the bubble map type. We will need to select Redivis as the data source for this map.

Then we will need to define which of our variables map to which parts of the visual. We will also want to change how they are aggregated, from SUM to AVG.

  • Location: lattitude_longitude

  • Tooltip: County

  • Size: (AVG) population

  • Color metric: (SUM) gini_estimate

The resulting map is automatically interactive. Users can hover over each bubble and view the county name, population, and Gini coefficient. The bubble size is determined by the county population and the bubble color is determined by the Gini value. View the interactive report:

You can continue to create a variety of visualizations in this report, including scatter plots, Google maps, and stacked bar charts.

Refer to the Looker Studio Help pages for additional specific guidance!

Next steps

Perhaps we see something in this workflow we want to tweak, or we want to go back and change some of our data decisions. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them.

Notebooks can also create output tables which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also fork this workflow to work on a similar analysis, or export any table in this workflow for work elsewhere.

Step: Stack

Overview

The Stack step will add or remove rows from another table that align with your existing variable structure.

Example starting data:

Source (t0)             table1 (t1)
/*---------+--------*   /*---------+--------+
 | student | score  |    | student | score  |
 +---------+--------+    +---------+--------+
 | jane    | 83     |    | tom     | 83     |
 | kim     | 35     |    | sherri  | 92     |
 | sam     | 74     |    | pat     | 48     |
 | zay     | 62     |    | jade    | 87     |
 *---------+--------*/   *---------+--------*/

Example output data:

Union rows from Table B.

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | kim     | 35     |
 | sam     | 74     |
 | pat     | 62     |
 | tom     | 83     |
 | sherri  | 92     |
 | pat     | 48     |
 | jade    | 87     |
 *---------+--------*/

Step structure

Field descriptions

Field
Description

Stacked table

The table containing the rows you wish to add.

Stack type

How the rows will be added. (More information below.)

Variable alignment

How you will define the variable structure between the source and the Stacked table. All matching variables: Will automatically match variables and only include variables with a matching name, the rest will be dropped.

All variables: Will automatically match variables and keep all variables in the output. Values will be marked null where there is no match for an existing variable Manually match variables: Will require you to align variables between two tables.

Retype variables

Variables will be retyped in order to be aligned.

Only keep distinct rows

If any added rows are an exact duplicate of an existing row it will be dropped from the output.

Create variable for table name (Union only)

In order to record in the data where new rows came from you can opt to create a new variable with table information.

Stack types

The most common stack type is a Union, which can be particularly useful when combining data which has been broken up into multiple tables with the same structure.

Union

Appends all of the stacked table’s rows to the source table.

Except

Keep only those distinct rows in the source that match rows in the stacked table.

Intersect

Keep only those distinct rows that are in the source and stacked table.

Examples

Example 1: Basic union

Let's say I have data broken out into two tables with the same structure. I want to add all rows together as they are.

Starting data:

Source (t0)             table1 (t1)
/*---------+--------*   /*---------+--------+
 | student | score  |    | student | score  |
 +---------+--------+    +---------+--------+
 | jane    | 83     |    | tom     | 83     |
 | kim     | 35     |    | sherri  | 92     |
 | sam     | 74     |    | pat     | 48     |
 | zay     | 62     |    | jade    | 87     |
 *---------+--------*/   *---------+--------*/

Input fields:

  • Stacked table: My data is in table1 so we select that here

  • Stack type: We want to add all rows so we select Union here.

  • Variable alignment: Somehow our variable names shifted between tables so we can't automatically align tables. We choose Manually match variables here and then fill in the variable names under Source variable and Stacked table variable.

    • Where the variable names are different, the output will keep the name of the Source variable when it exists.

Output data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | kim     | 35     |
 | sam     | 74     |
 | pat     | 62     |
 | tom     | 83     |
 | sherri  | 92     |
 | pat     | 48     |
 | jade    | 87     |
 *---------+--------*/   

Example 2: Bulk union and variable misalignment

To continue the previous example, let's say we have a third table of data, and it has additional information that was not present in the first two about state. We want to keep this information and also keep track of which rows came from which table.

Starting data:

Source (t0)             table1 (t1)             table2 (t2)
/*---------+--------*   /*---------+--------*   /*---------+--------+------------*
 | student | score  |    | student | score  |    | student | score  | date       |
 +---------+--------+    +---------+--------+    +---------+--------+------------+
 | jane    | 83     |    | tom     | 83     |    | barb    | 46     | 2020-01-01 |
 | kim     | 35     |    | sherri  | 92     |    | mitch   | 79     | 2020-01-01 |
 | sam     | 74     |    | pat     | 48     |    | oleg    | 68     | 2020-01-01 |
 | zay     | 62     |    | jade    | 87     |    | maria   | 85     | 2020-01-01 |
 *---------+--------*/   *---------+--------*/   *---------+--------+------------*/

Input fields:

  • We create one block for the table1 union, and a second block for the table2 union

  • Variable alignment: Since we want to keep all variables no matter if they have matches, we select All variables.

  • Create variable for table name: We check this box since we want to keep track of which table each row came from.

    • Note that Source table name is automatically populated with the name of this transform's source table.

Output data:

/*---------+--------+------------+---------------*
 | student | score  | date       | source_table  |
 +---------+--------+------------+---------------+
 | jane    | 83     | null       | cohort        |
 | kim     | 35     | null       | cohort        |
 | sam     | 74     | null       | cohort        |
 | zay     | 62     | null       | cohort        |
 | tom     | 83     | null       | table1        |
 | sherri  | 92     | null       | table1        |
 | pat     | 48     | null       | table1        |
 | jade    | 87     | null       | table1        |
 | barb    | 46     | 2020-01-01 | table2        |
 | mitch   | 79     | 2020-01-01 | table2        |
 | oleg    | 68     | 2020-01-01 | table2        |
 | maria   | 85     | 2020-01-01 | table2        |
 *---------+--------+------------+---------------*/  

Example 3: Except

Let's say we have a table with information about students and their test scores, and we have identified some students that have dropped the course and that we no longer need in the original table. We have gathered those students in table1.

Starting data:

Source (t0)             table1 (t1)
/*--------+-------*      /*------+-------*
 | name   | score |       | name | score |
 +--------+-------+       +------+-------+
 | jane   | 83    |       | kim  | 35    |
 | kim    | 35    |       | pat  | 48    |
 | sam    | 74    |       *------+-------*/
 | pat    | 62    | 
 | tom    | 83    |
 | sherri | 92    |
 *--------+-------*/       

Input fields:

Output data:

/*--------+-------*
 | name   | score |
 +--------+-------+
 | jane   | 83    |
 | sam    | 74    |
 | tom    | 83    |
 | sherri | 92    |
 *--------+-------*/       

The rows that were in both tables are now removed.

Example 4: Intersect

Let's say we have two tables full of students, one from who took our fall semester class, and one from the spring semester class. We want to only keep student rows for people who took both classes, so those who are present in both tables.

Source (t0)             table1 (t1)
/*--------+-------*      /*------+-------*
 | name   | id    |       | name | id    |
 +--------+-------+       +------+-------+
 | jane   | 101   |       | kim  | 104   |
 | kim    | 104   |       | pat  | 108   |
 | sam    | 105   |       | mae  | 109   |       
 | pat    | 108   |       | zay  | 110   | 
 | tom    | 112   |       | jade | 111   |
 | sherri | 117   |       | tom  | 112   |
 *--------+-------*/      *------+-------*/      

Input fields

Output data:

/*------+-------*
 | name | id    |
 +------+-------+
 | kim  | 104   |
 | pat  | 108   |
 | tom  | 112   |
 *------+-------*/     

Only rows that were in both tables now remain.

Analyzing large tabular data

This guide demonstrates using a Redivis workflow to gather key variables from different tables, clean them, and and consolidate them into a single table for analysis in a notebook.

Workflow objective

We want to take weather data collected around the world and use it to understand how precipitation trends in the US have changed over time.

! We also suggest you recreate this workflow as we go to best learn the process.

1. Explore data

All the weather data we need is contained in the Demo organization dataset .

To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need. Some tables jump out as ones we will want to work with:

Daily observations

This table has nearly 3 billion records and seems to be the table in the dataset with the most information so we will start here. There is a variable that looks like it will be very helpful for our goal: element

If we click on the variable name, we can see a searchable frequency table with all this variable's values. One value that jumps out is PRCP which we can see on the value label represents the precipitation. Paired with this is the variable value which contains a numeric value recording the magnitude of the element. We can learn even more about this by clicking on the data dictionary in the dataset's overview page to see that this value is measured in either mm or inches.

We can see that this table doesn't contain any information about location though, so we'll need to find that somewhere else.

Stations

This table contains latitude and longitude variables as well as a linking variable id which will allow us to link in precipitation information from the Daily observations table, since it has the same id variable.

Since there are two tables we want information from we know that we will need to do a join as part of our data reshaping.

2. Create a workflow

At the top of this dataset page we can click the Analyze in workflow button to get started working with this data.

You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.

3. Clean and reshape data

We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with. Even though we might be more comfortable with Python or R, the table we want to work with is 132GB and we would not be able to work with it in a notebook without specialized equipment.

Filter for precipitation

If we click on the element variable we can see that there are many types of elements recorded in this table. Since we only want information about precipitation (PRCP), we will start by filtering out any records with a different value in the element variable.

Start by creating a new transform on the Daily observations table, and add a step. Configure the filter to select for records where element = PRCP.

Compute annual precipitation

Looking at the data, it's clear there are many observations of precipitation at each location over time. We know we want to look at annual trends so we can start by aggregating this data to only have one value per year per location.

Conceptually, we will need to aggregate on the year of each record, but if we look at the date variable in this table it also contains the month and day. For aggregation we will need a field which contains only the year which we can collapse on.

In this same transform we will add a step and make a new variable named year using the date extract method, which will pull the year out of the date value.

Then we will add an step to aggregate on the values based on our new year variable. The first thing we will do is select the variables to aggregate on, or select which variables that we want all records with duplicate values to be grouped together. Since we want information on precipitation per station per year we should choose to aggregate on the id variable (station) and year variable. When executed this will group all records with the same combination of values in id and year, drop all other variables from the table, and then drop duplicate records down to just one record per unique combination in those two variables.

But the most important step here is that we want to gather information about the duplicate records that were dropped! If there were 400 records of precipitation in a year for a particular station, we want to know what those records all add up to. To do this we will Create a new aggregate variable within this aggregation step named annual_precip. We want to aggregate the information in the value column, since that contains the numeric amount of the corresponding element variable. Since we want the total amount across all dropped records we will use the Sum method.

Select variables and run

The final step in a transform is selecting which variables we would like to populate the resulting output table. Since we did an aggregation step, this list becomes limited to only the variables that we aggregated on. Since we want both of these variables in our output we make sure both are selected and visible in the right side list of the footer.

With everything in place we will run this transform to create a new table, by pressing the Run button in the top right corner!

4. Sanity check the output table

Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.

Table characteristics

Click on the output table below the transform to view it. We can see that it contains the two variables we expected it to, based on our variable selection. The table has almost 3 million records, down from the almost 3 billion in our original table, which makes sense given our aggregation step.

Variable characteristics

We can also inspect each variable further. If we click on the year variable we can see that all values are a four digit year value and we can see from the min and max values in the summary statistics that the years range from 1781 - 2021 which make sense. Looking at the frequency table we can see that we have a high frequency of each year represented in the records which is what we expect since there are multiple ids or locations per year.

If we click on the annual precipitation variable we can see a max value of 792,402 (mm or in) which seems very high but possible for a given year. The minimum value is -148,917 which doesn't seem right. When we look at the data dictionary it doesn't indicate that there should be negative values.

We can investigate our data further by clicking on the Query tab of the table and choosing "Sort on variable" from the templates menu, and sort on our annual_precip variable by ascending (ASC). It looks like there are a relatively few number of stations with negative values.

If we were doing a rigorous analysis of this data we might dig deeper into why these value exist in this dataset, or decide to exclude these outliers. However since this is an example we can decide to leave them in and see how it affects our outputs. We can easily go back and exclude them from the workflow at a later point.

5. Join geographic information

Now that we have a table we are satisfied with that contains annual precipitation information, we want to join in the latitude and longitude information about each station so we can make a geographical visual as part of our analysis.

This geographic information is included in the Stations table, so we will need to do a join which connects records from two tables.

While we could do this transformation step in our previous transform and rerun it, let's create a new transform to make this join so that we will have our previous output table saved to compare the output to.

In the new transform add a step. Set the joined table to the Stations table from our original dataset. The join type will depend on what variables we want to keep based on how the matching process goes. Since we only want to keep records that have both information from our source table (annual precipitation) AND information from our joined table (station lattitude and longitude) we set the join type to Inner join.

Since our identifier in both table is the variable id, set id from the source table (t0) as the left side of the join condition and set id from the joined table (t1) as the right side of the join condition. This means that for every record in our source table, look for a corresponding value in the Stations table and join the values from additional columns in that table.

When you set two variables equal to each other in a Join, they must be of the same type (string, integer, etc). In this example both id variables are string type so it works, but if one had been an integer we would have needed to retype one of them first to match.

There are a number of additional variables in the stations table but we are only interested in the latitude and longitude variables. In our variable selector we move all variables from our source table (year, id, annual_precip) as well as latitude and longitude.

With everything set up, run this transform and sanity check the output table! We started with a 132GB table and have narrowed it down to 134MB containing only the data we want in the format we will need.

6. Analyze outputs in a notebook

Next we want to generate an interactive visual to understand annual precipitation events in different locations over time. To do this we will create a node on our output table.

For this workflow we will use Python but you can also use R, Stata, or SAS if you'd like. When you create the notebook for the first time it will start up. Notebooks must be running to execute code.

Redivis notebooks come with many common packages preinstalled and we will use those in this example. If you'd like you can install additional packages by clicking the Dependencies button.

Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.

Reference data

Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook.

You can use this pandas code or replace it to use the of your choice. To use the standard code, click inside this cell and press the run button, or the Shift + Enter keys.

Select data from the United States

Now we will create a new cell to organize our work by pressing the + button in the top bar.

In this cell we want to limit our records to only ones whose latitude and longitude values fall in a specific range of the continental United States.

Compute average change

We want to see how our annual precipitation variable has changed at stations in the United states so for each station we will group by the id variable and compute an average for annual_precip.

Plot the results

Now that we have the information calculated, we can plot it! We'll use to do the heavy lifting, and point it at our relevant variables.

Now we've created an interactive figure showing change in precipitation by location in the United states.

Next steps

Perhaps we see something in this plot we want to investigate further, or we want to go back and change some of our assumptions. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and them.

Notebooks can also which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also this workflow to work on a similar analysis, or any table in this workflow for work elsewhere.

Create and populate a dataset

Overview

Datasets are a core component of Redivis. Consisting of documentation, metadata, and tables, datasets allow you to store, version, and distribute a wide variety of data.

Anyone with a Redivis account can create a dataset in their workspace, and organization administrators can upload one to an organization via the administrator panel.

1. Create the dataset

You can create a new dataset by navigating to the Datasets tab of your or and clicking the New dataset button.

All datasets must have a name (unique to datasets for the user / organization).

You can set up your dataset in whatever order you'd like, but we recommend following the order below when getting started.

2. Upload data

This data might be in a tabular format (.csv, .tsv, .sas etc.) or more rarely in unstructured data such as images and text files.

Tabular data

All tabular data is associated with a and each dataset can have one or more tables. While you may release a dataset without any tables, this will be of limited use to other researchers, as Redivis provides numerous tools for understanding, querying, and generally working with tabular data.

If you haven't already worked with data in a workflow, we strongly recommend exploring that before creating dataset so you can understand how researchers will work with your data.

When you're ready to upload data, we have broken that out into a separate guide.

Learn more in the guide.

Unstructured data

For unstructured data, go to the Files tab where you can upload files from your computer or another location your data is stored via an integration. You can put these into folders and create index tables to better keep track of.

Note that any files uploaded here can't be transformed in the workflow tool or queried across Redivis (which require the table format).

Make sure any files you upload here contain this dataset's data. Any files with information about the data (such as data dictionaries or usage guides) should be uploaded as documentation on the Overview tab.

Learn more in the guide.

3. Edit metadata

It's easy to feel "done" after uploading your data, but documentation and metadata are essential to the usability of your dataset. Moreover, rich metadata will improve the discoverability of your dataset by providing more information and terms to the Redivis search engine.

Metadata can always be updated after your dataset has been released. While good metadata are essential, it can be a time consuming and iterative process, so you might prefer to provide some basic content initially, and then improve it over time.

Dataset metadata

On the overview tab of the dataset, you can provide an abstract, detailed documentation blocks, supporting files and links, and subject tags for the dataset.

The abstract should be a brief overview of the dataset, while the rest of the documentation can be as thorough as you'd like. Each documentation block has a header for suggested content, and any you don't fill out won't be shown on the dataset page. These blocks contain rich text editor complete with embedded images. Most of this information will be visible to anyone with , though you can also that require a higher level of access.

Make sure to audit your data's provenance information to give attribution to whoever is working on the data. If this dataset is part of an organization you can to issue a DOI for each dataset. Note that if your organization is configured to issue DOIs, then one will automatically be issued for this dataset when you first publish it.

Table metadata

To help users understand what each table represents, you should update the description, entity, and temporal range for each table in the dataset. The should define what each row in a table represents: is it a person? an event? a charge? The temporal range can be tied to a specific variable (using the min/max of that variable), or defined explicitly.

Variable metadata

The tables in your dataset are made up of named variables, though rarely is this name enough to understand what the variable measures. On any table, click "Edit variable metadata" to order to populate the .

On each variable, Redivis supports a , , and . The label is the most essential item, think of it as a more human-readable variable name. The description should contain more detailed information, everything from caveats and notes to collection methodology. Value labels are only applicable when the variable is encoded with keys (often integers or short strings) that map to the actual value — for example, a survey might be encoded as 0: "No" 1: "Yes" 2: "Don't know" 3: "Declined to answer".

Editing variable metadata can be a tedious process, but Redivis does support the ability to , and will also automatically extract metadata if it's present in the uploaded data files (e.g., Stata or SAS upload types).

Learn more in the reference section.

4. Create a sample

If your dataset is particularly large, or if you want to control access to a sample of the data separately from the whole dataset, you should configure sampling on your dataset. This will allow researchers to work with a 1% sample of the data during initial exploration, and allow you to grant access to the sample independently of the full dataset.

To update the dataset's sample configuration, click on any table, and then click "Configure sample". When configuring the sample, you can generate a random sample for each table, or sample on a particular variable that is common across tables. If researchers will be joining tables across your dataset, it is highly recommended that you sample on that common join variable so that researchers can contain a consistent 1% sample as they work with your data.

Learn more in the reference section.

5. Configure access

Before releasing your dataset, it is important to define who can access the dataset and what the procedures are for applying and gaining access. Click the Configure access button on the top of the page to set up the access configuration.

Datasets owned by organizations have more options for access than datasets owned by users.

Access levels

Dataset access has five levels:

  1. Overview: the ability to see a dataset and its documentation.

  2. Metadata: the ability to view variable names and summary statistics.

  3. Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.

  4. Data: the ability to view and query a dataset's tables, and work with them in workflows.

  5. Edit: the ability to edit the dataset and release new versions.

Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.

Usage rules

Even with data access, you may want to limit what other users can do with your dataset. Currently, you can configure export restrictions that limit:

  • The download location (e.g., to prevent researchers from downloading to their personal computer)

  • The download size, in bytes and/or rows

  • Enforce admin approval before any export

Editors

You may also add additional dataset editors to help upload data and provide metadata content. These editors will be able to create and release new versions, and will have full access to the underlying data, though they cannot add other users, modify the access configuration, or bypass the dataset usage rules.

If the dataset is hosted by an organization, all administrators of the organization will be able to the edit the dataset as well as its access configuration.

Permission groups

If the dataset is hosted by an , you will have additional options for configuring access to the dataset. The dataset can be assigned to a permission group to help standardize access procedures, and this permission group can contain that help data managers fulfill contractual requirements and gather relevant information about the research being done on the dataset.

Learn more in the guide.

5. Release the dataset

Congratulations! Your dataset is ready to be released and utilized by the research community. But first, it is highly recommended that you validate and audit your dataset beforehand. Take a look at the number of rows, variables, and uploads in each table. Validate some of the variable summary statistics against what you expect. And to be truly thorough, add the dataset to a workflow and run some queries as if you were a researcher. Catching a mistake now will prevent headaches down the line if researchers uncover unexpected discrepancies in the data.

Once a version has been released, the data can no longer be edited. While you can unrelease a version within 7 days, this should generally be avoided; you'll need to release a new version to modify the data.

When you're confident that you're ready to go, click the "Release" button on the top of the page. If the button is disabled, hover over it to understand what issues are currently preventing you from releasing.

After clicking the button, you'll be presented with a final checklist of tasks. When you click the Release version button, the dataset will be immediately released and available to all users with access.

This dataset is now also considered Published. If you need to pause all activity and access to this dataset, you can return to this page in the future and Unpublish it temporarily.

6. Make updates as new versions

Once a dataset is released, you can return to it to make changes at any time. Changes to datasets are tracked in Redivis as . Anyone with access to a dataset can view and work with any of its versions.

How to work with versions when updating a dataset:

  • Any edits to the data content in tables will need to be released as a new version.

  • Edits to the dataset information, table information, or variable metadata can be made on the current version (or historic versions) and will be live as soon as it's saved.

  • Edits to the dataset name and access configuration will always affect all versions.

Creating the next version

All data within a dataset is encapsulated in discrete, immutable versions. Every part of the dataset except for the name and access settings are versioned. All tables in a dataset are versioned together.

After releasing the first version of the dataset, you can choose to create a new version at any time by clicking the button in the top right "Create next version". This version will be created as vNext, and you may toggle between this and historic versions at any time.

Subsequent versions always build on the previous version of the dataset, and changes made in the next version will have no affect on previous versions. Alongside modifications to the dataset's metadata, you may create, update, or delete any of the previous version's tables.

Replacing vs appending data

When uploading data to a previous table, you can choose whether you want to append these new uploads to your existing data, or replace the entire table with the new data.

Version storage costs

Redivis computes row-level diffs for each version, efficiently storing the complete version history in one master table. This allows you to regularly release new versions and maintain a robust version history without ballooning storage costs.

Learn more in the , and reference sections.

Next steps

Start working with your data

Once your dataset is released, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Learn more in the guide.

Navigation

First value

Returns the value of a variable for the first row in a given analytic window. –>

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Lag

Returns the value of a variable on a preceding row within the analytic window –>

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Last value

Returns the value of a variable for the last row in a given analytic window. –>

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Lead

Returns the value of a variable on a subsequent row within the analytic window –>

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Nth value

Returns the value at the Nth row of a given window frame. –>

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Percentile (continuous)

Computes the specified percentile value for a variable within an ordered partition, with linear interpolation. –>

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Percentile (discrete)

Computes the specified percentile value for a variable within an ordered partition. Returns the first sorted value with cumulative distribution greater than or equal to the percentile. –>

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Files

Overview

The files tab on Redivis datasets Files are data entities uploaded to on Redivis. Files are used to store non-tabular (a.k.a. unstructured) data of any file type. You can view a dataset's files by clicking on the Files tab of any dataset or in a workflow.

Inspecting files

You can inspect an individual file by clicking on its name to launch the file viewer. Any file can be downloaded or referenced within a notebook, and many file types can be previewed directly within Redivis, including:

  • 3D models

  • Audio files

  • CIF + PDB files (molecular + protein structures)

  • FITS files (common in astronomy)

  • DICOM

  • HDF5

  • HTML

  • Images

  • PDFs

  • Videos

  • Text/code

  • TIFFs

  • ZIPs

  • TEI

If you have a file type that you think should be supported that isn't, please !

You can view on the Redivis Demo organization.

Folders and index tables

All files are assigned to a "folder" within the dataset. You can click on the folder name in the right grey bar to filter this dataset's files by the folder they are in.

Each folder has a corresponding index table that is present on the Tables tab of the dataset. These will match the folder name and have a File index label.

In these index tables, each row of the table represents a file in that folder.

Variable name
Description

Working with files

By representing files within an index table, we can query and subset the files , while joining them with other tabular metadata. In this way, file index tables behave like any other tabular metadata on Redivis.

To do deeper analysis of files, we can for further analysis. Consult the documentation for your preferred programming language to learn more:

FIRST_VALUE(@variable[ @ignore_null NULLS])

@variable

variable

any Redivis type

true

-

@ignore_null

enum

any of: IGNORE, RESPECT

false

(Ignore nulls (default))

LAG(@variable, @literal[, @default_value])

@variable

variable

any Redivis type

true

-

@literal

literal

any integer

false

(1)

@default_value

literal

any Redivis type

false

(NULL)

LAST_VALUE(@variable[ @ignore_null NULLS])

@variable

variable

any Redivis type

true

-

@ignore_null

enum

any of: IGNORE, RESPECT

false

(Ignore nulls (default))

LEAD(@variable, @literal[, @default_value])

@variable

variable

any Redivis type

true

-

@literal

literal

any integer

false

(1)

@default_value

literal

any Redivis type

false

(NULL)

NTH_VALUE(@variable, @literal[ @ignore_null NULLS])

@variable

variable

any Redivis type

true

-

@literal

literal

any integer

true

-

@ignore_null

enum

any of: IGNORE, RESPECT

false

(Ignore nulls (default))

PERCENTILE_CONT(@variable, @literal[ @ignore_null NULLS])

@variable

variable

any integer, float

true

(variable)

@literal

literal

any float

true

(Value between [0, 1])

@ignore_null

enum

any of: IGNORE, RESPECT

false

(Ignore nulls (default))

PERCENTILE_DISC(@variable, @literal[ @ignore_null NULLS])

@variable

variable

any integer, float

true

(variable)

@literal

literal

any float

true

(Value between [0, 1])

@ignore_null

enum

any of: IGNORE, RESPECT

false

(Ignore nulls (default))

learn more
learn more
learn more
learn more
learn more
learn more
learn more
workspace
administrator panel
table
Upload tabular data as tables
Upload unstructured data as files
overview access
create custom documentation sections
configure a DataCite account
entity
variable metadata
label
description
value labels
import metadata from a file
Documentation
Dataset sampling
organization
requirements
Configure access systems
versions
Usage and limits for users
Billing for organizations
Analyze data in a workflow

file_id

A unique, system generated identifier for the file. Use the file_id to reference and download specific file(s) in your analysis.

file_name

The name of the file

size

Size of the file in bytes

added_at

Timestamp for when the file was added to this folder

md5_hash

MD5 checksum of the file contents, encoded as a base64 string

datasets
dataset node
let us know
example files
within a workflow
load these files in a notebook
Working with unstructured data files in Python >
Working with unstructured data files in R >
View files in Redivis
Yes, Redivis supports cat images

Common elements

Parameters

When creating a new variable, you'll use a variety of inputs to specify the parameters required to fully define a given method. Each parameter of the following types:

variable

Refers to a Redivis variable, often limited to a specific subset of types. When using variable parameters in a given transform, you'll be able to select variables from the source table, variables in tables referenced by joins, or new variables created upstream of the parameter.

literal

Refers to a string, integer, boolean, or floating point value. When using literal parameters, you'll be able to type any constant value (e.g, false, 1.0, 'test'), or use a value list to reference the same set of values anywhere in your workflow. If using a literal parameter alongside another parameter, e.g., in a filter comparison, you may have to match the type of literal with that parameter.

enum

Refers to a set of distinct values, usually of a homogenous type. For example, if a required parameter used a "timezone" enum with 3 options (PST, MST or EST), you would have to select one of the three options (PST) for the value of that parameter.

boolean

Refers to values true or false.

Null

In many places throughout the transform you will have an option in the system menu to set a cell value(s) to NULL. Conceptually this means that the cell is empty and contains no value. If you look at the Cells view of a table, you will see NULL values shown in a grey color to indicate that these cells are empty.

Note that this is different than if the cell contained a string with the characters NULL. In this case the cell does have contents and will be treated so.

You will see throughout the transform references to how nulls are included or excluded for steps (such as the Order step) and in calculating summary statistics.

Format elements for dates and time methods

Some new variable methods (PARSE... and FORMAT...) allow for manipulation of Date, Time, or DateTime variable types. To do so, you'll need to define how the data is formatted using format elements.

For example, to work with dates in a mm/dd/yy format (common in the U.S.; e.g. 03/22/89 ), we would specify the format string %m/%d/%y . For a DateTime displayed as Mon Oct 17 2016 17:32:56 , we would specify %a %b %d %Y %H:%M:%S

Note that different format elements are used for the Retype step (CAST method) which are detailed elsewhere.

Below is a complete list of format elements and descriptions:

Format element

Description

%A

The full weekday name.

%a

The abbreviated weekday name.

%B

The full month name.

%b or %h

The abbreviated month name.

%C

The century (a year divided by 100 and truncated to an integer) as a decimal number (00-99).

%c

The date and time representation.

%D

The date in the format %m/%d/%y.

%d

The day of the month as a decimal number (01-31).

%e

The day of month as a decimal number (1-31); single digits are preceded by a space.

%F

The date in the format %Y-%m-%d.

%G

The ISO 8601 year with century as a decimal number. Each ISO year begins on the Monday before the first Thursday of the Gregorian calendar year. Note that %G and %Y may produce different results near Gregorian year boundaries, where the Gregorian year and ISO year can diverge.

%g

The ISO 8601 year without century as a decimal number (00-99). Each ISO year begins on the Monday before the first Thursday of the Gregorian calendar year. Note that %g and %y may produce different results near Gregorian year boundaries, where the Gregorian year and ISO year can diverge.

%H

The hour (24-hour clock) as a decimal number (00-23).

%I

The hour (12-hour clock) as a decimal number (01-12).

%j

The day of the year as a decimal number (001-366).

%k

The hour (24-hour clock) as a decimal number (0-23); single digits are preceded by a space.

%l

The hour (12-hour clock) as a decimal number (1-12); single digits are preceded by a space.

%M

The minute as a decimal number (00-59).

%m

The month as a decimal number (01-12).

%n

A newline character.

%P

Either am or pm.

%p

Either AM or PM.

%R

The time in the format %H:%M.

%r

The 12-hour clock time using AM/PM notation.

%S

The second as a decimal number (00-60).

%s

The number of seconds since 1970-01-01 00:00:00. Always overrides all other format elements, independent of where %s appears in the string. If multiple %s elements appear, then the last one takes precedence.

%T

The time in the format %H:%M:%S.

%t

A tab character.

%U

The week number of the year (Sunday as the first day of the week) as a decimal number (00-53).

%u

The weekday (Monday as the first day of the week) as a decimal number (1-7).

%V

The week number of the year (Monday as the first day of the week) as a decimal number (01-53). If the week containing January 1 has four or more days in the new year, then it is week 1; otherwise it is week 53 of the previous year, and the next week is week 1.

%W

The week number of the year (Monday as the first day of the week) as a decimal number (00-53).

%w

The weekday (Sunday as the first day of the week) as a decimal number (0-6).

%X

The time representation in HH:MM:SS format.

%x

The date representation in MM/DD/YY format.

%Y

The year with century as a decimal number.

%y

The year without century as a decimal number (00-99), with an optional leading zero. Can be mixed with %C. If %C is not specified, years 00-68 are 2000s, while years 69-99 are 1900s.

%%

A single % character.

%E#S

Seconds with # digits of fractional precision.

%E*S

Seconds with full fractional precision (a literal '*').

%E4Y

Four-character years (0001 ... 9999). Note that %Y produces as many characters as it takes to fully render the year.

Accessor elements for JSON methods

Some new variable methods (JSON extract..., JSON scalar, etc) allow for accessing subsets of the data contained within JSON-formatted strings. To do so, you'll need to define how the data is accessed using a JSONPath element.

For example, to access the firstName attribute within a JSON-formatted string { "firstName": "John", "lastName": "Doe" }, you'd specify a JSONPath element $.firstName.

More information is available in BigQuery's JSONPath format documentation.

us_df = df[
    (df.latitude >= 24.396308)
    & (df.latitude <= 49.384358)
    & (df.longitude >= -124.848974)
    & (df.longitude <= -66.885444)
# Compute avg precip for each station between 1980-2010
df_1980_2010_average = us_df[(us_df.year > 1980) & (us_df.year < 2010)].drop(columns=['year'])
df_1980_2010_average = df_1980_2010_average.groupby(['id']).mean()
df_1980_2010_average.rename(columns={"annual_precip": "_1980_2010_avg"}, inplace=True)

# Compute avg precip for each station between 1990-2020
df_1990_2020_average = us_df[(us_df.year > 1990) & (us_df.year < 2020)].drop(columns=['year'])
df_1990_2020_average = df_1990_2020_average.groupby(['id']).mean()
df_1990_2020_average.rename(columns={"annual_precip": "_1990_2020_avg"}, inplace=True)

diff_df = df_1980_2010_average.join(df_1990_2020_average["_1990_2020_avg"])
diff_df['deviation'] = (diff_df._1990_2020_avg - diff_df._1980_2010_avg) / diff_df._1980_2010_avg * 100
diff_df = diff_df[(diff_df.deviation < 25) & (diff_df.deviation > -25)]

diff_df = diff_df.dropna()
diff_df
import plotly.figure_factory as ff

import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
px.set_mapbox_access_token("pk.eyJ1IjoiaW1hdGhld3MiLCJhIjoiY2thdnl2cGVsMGtldTJ6cGl3c2tvM2NweSJ9.TXtG4gARAf4bUbnPVxk6uA")

fig = ff.create_hexbin_mapbox(
    data_frame=diff_df, lat="latitude", lon="longitude",
    color="deviation",
    agg_func=np.mean,
    title="% Change precipitation, 1981-2010 vs 1991-2020",
    range_color=[-15,15],
    nx_hexagon=50, opacity=0.4, labels={"color": "Percent change"}, color_continuous_scale="Icefire_r",
)

fig.update_layout(margin=dict(b=0, t=0, l=0, r=0))
fig.show("nteract")
This workflow is on Redivis
GHCN Daily Weather Data
Filter
Create variables
Aggregate
Join
Notebook
dataframe
Plotly
rerun
create output tables
fork
export

Step: Pivot

Overview

The Pivot step rotates data values into new variables using aggregation.

Example starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 *---------+-------+---------+------------*/

Example output data:

Pivot on the test variable.

/*---------+------+---------+-------*
 | student | quiz | midterm | final |
 +---------+------+---------+-------+
 | jane    | 83   | 74      | 77    |
 | pat     | 35   | 62      | 59    |
 | sam     | 89   | 93      | NULL  |
 *---------+------+---------+-------*/

Step structure

  • There is one pivot block where we will define the pivot variable and values.

  • There are one or more aggregation blocks where we will define how the table will reshape and what data populates the newly created variables.

    • There must always be at least one aggregate block.

    • The total number of new variables in your output table will be the number of Pivot values defined * The number of aggregation blocks.

Input field definitions

Pivot block:

Field
Definition

Pivot variable

The variable containing the values that will become new variables.

Variables to collapse on

All variables you want to include in your output, except for variables defined elsewhere in the pivot operation (pivot variable, variable to aggregate on).

Pivot value

A value from your pivot variable which will become a new variable in your output table. You can add multiple values. You do not need to include every value of this variable.

New variable name

An option to rename any of the new variables created in the pivot value field. Leaving this field blank will persist the value name into the variable name.

Aggregation block:

Field
Definition

Alias

A value that will be prefixed to your new variable names. If you are aggregating one variable this field is optional, but if you are aggregating multiple then you'll need to specify a value for each one.

Aggregation method

How the variable to aggregate will be summarized (e.g. SUM, COUNT).

Variable to aggregate

Which variable's data will populate the new variables you are creating in the pivot value field.

Examples

Example 1: Basic pivot

Let's say we've recorded data about tests given and student scores and we want to pivot in order to see information sorted by student.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 74    | jane    | 2020-04-15 |
 | quiz    | 83    | sam     | 2020-04-15 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 91    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Pivot variable: Since we want to create new variables based on values in the student variable, we select that as our Pivot variable.

  • Variables to collapse on: We want to include all variables here that will be the final shape of our pivoted table that are not already defined as our pivot variable or variable to aggregate. We include test and date in this example.

  • Pivot value: We need to choose one or more values from our selected pivot variable (student) to become new variables in our output table. This does not have to include every value in this variable, even though we use all of them in this example.

  • New variable name: We don't want to rename any of these new variables, so we can leave this blank.

  • Alias (optional): Since we only have one variable to aggregate, we can leave this blank.

  • Aggregation method: We need to choose how our variable to aggregate (score) will be aggregated. In this example data we don't have duplicated values so what we choose here won't matter. So we choose SUM to validate the process.

  • Variable to aggregate: This is the variable (score) that will be removed from this table, and its values will be redistributed to our newly created variables (jane, pat, and sam) with the aggregation method we selected handling any duplicate values (in this example, SUMming them).

Output data:

/*---------+------------+------+------+------*
 | test    | date       | jane | pat  | sam  |
 +---------+------------+------+------+------+
 | quiz    | 2020-04-01 | 83   | 35   | 89   |
 | quiz    | 2020-04-15 | 74   | NULL | 83   |
 | midterm | 2020-05-01 | 74   | 62   | 93   |
 | final   | 2020-06-01 | 77   | NULL | 91   |
 *---------+------------+------+------+------*/

Example 2: Pivot with collapse

Continuing this example, let's say we want to average scores from the two quizzes. We can collapse the table by not including the date variable. Without that variable there would be two identical values in test (quiz) which will be collapsed into one.

Starting table:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 74    | jane    | 2020-04-15 |
 | quiz    | 83    | sam     | 2020-04-15 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 91    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

In this case it does matter what we select for Aggregation method, since the duplicate values for quiz will get aggregated. We select Average here.

Pivoted output table:

/*---------+------+------+------*
 | test    | jane | pat  | sam  |
 +---------+------+------+------+
 | quiz    | 78.5 | 35   | 86   |
 | midterm | 74   | 62   | 93   |
 | final   | 77   | NULL | 91   |
 *---------+------+------+------*/

Example 3: Multiple aggregation variables

We can add another variable aggregation block to gather more information about the variables being aggregated. Continuing this example, let's say we not only want to average the quiz scores, but count the number of averaged quizzes.

Starting table:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 74    | jane    | 2020-04-15 |
 | quiz    | 83    | sam     | 2020-04-15 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 91    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

Note that the Alias field is no longer optional since we have two aggregation blocks.

We make sure to select score for our Variable to count since we specifically want to know how many scores were included in our average. Since we have some null scores, selecting a different variable here might give us a different result.

Pivoted output table:

/*---------+---------------+--------------------+--------------+----------------*
 | test    | jane_scoreavg | jane_scorecount    | pat_scoreavg | pat_scorecount |
 +---------+---------------+--------------------+--------------+----------------+
 | quiz    | 78.5          | 2                  | 35           | 1              |
 | midterm | 74            | 1                  | 62           | 1              |
 | final   | 77            | 1                  | NULL         | 1              |
 *---------+---------------+--------------------+--------------+----------------*/

The jane_scoreavg variable contains data from the scoreavg aggregation block's method of Average for the values of student Jane, while the jane_scorecount variable contains data from the scorecount aggregation block.

Dataset concepts

On Redivis, datasets are the fundamental container for persistent data storage. They are created by users or organizations, and can be distributed via simple sharing or more complex access configurations.

All data in datasets are stored within Tables (including geospatial features and unstructured files). Dataset editors can upload data to the dataset, modify metadata, and then release a new version for use in researchers' workflows.

A sample dataset landing page:

The dataset page

The dataset page is the user-facing of the dataset. This page combines various narrative and provenance information alongside structured metadata, the actual data content, and usage information.

At the top of the page is the dataset title, as well as a version indicator and bookmark button. Clicking the version indicator will open the version history and allow you to change the dataset's version; clicking the bookmark icon will add or remove the dataset from your dataset library.

To the top-right are two buttons: View / Apply for access and Analyze in workflow. The first will allow you to view the access rules for the dataset and apply as needed, whereas the latter will allow you to add the dataset to a new or existing workflow for analysis.

The rest of the page is organized into four tabs: Overview, Tables, Files, and Usage. If you are a dataset editor, you will also see an Edit dataset link to the right of the usage tab.

Overview

The Overview tab contains the dataset's documentation, provenance information, and top level metadata. The following information can be populated on a dataset:

Abstract

The abstract is limited to 256 characters and will show up in previews and search results for the dataset. This should be a concise, high-level summary of this dataset.

Provenance

This section is intended to display information about where this dataset came from and how it came to be in its current form. Redivis will auto-populate fields where possible but you can add additional information or override it.

Provenance

→ Creator

This field should be the individual(s) or organization(s) responsible for creating the content of this dataset. This will be linked with the appropriate ORCID iD or ROR if the individual or organization has attached them to their Redivis account. You can also include individuals and organizations that don't have Redivis accounts and include their identifier. If you have multiple creators you can edit the order they are shown, which will also be reflected in the citation.

Provenance

→ Contributor

This field is to attribute the work that different individuals have done to get the dataset into its current state. Redivis will automatically add anyone who edits the dataset to this field. If they have attached their ORCID iD to their Redivis account, that linkage will be shown as well. The contributor type options are aligned with DataCite standards.

Provenance

→ DOI

If your dataset belongs to an organization, you can issue a DOI (Digital Object Identifier) by configuring your . Any DOI issued for this dataset will remain in a draft status for seven days to allow for version unrelease. After seven days the DOI will become permanent.

Provenance

→ Citation

This section shows the automatically generated citation for this dataset in your chosen format. This can be copied or downloaded for use elsewhere.

Changes made to the "Creators" field will be reflected in this citation. Any DOI issued for this workflow will automatically be included in this citation.

Provenance

→ Citation instructions

If this dataset must be cited in a specific way, you can make that clear in this field. Citation instructions will be included beneath the citation wherever it is shown.

Provenance

→ Related identifiers

This section is for linking other identifiers (URLs or DOIs) that are relevant to this dataset. The related identifier type options are aligned with DataCite standards. You might want to include identifiers for DMPs, papers, and more.

Provenance

→ Bibliography

You can launch a bibliography which displays the citation of this dataset and every related identifier listed above.

Methodology and Usage notes

These documentation sections are intended for more extensive information and can contain formatted text and images. These will be indexed for the Redivis dataset search.

Supporting files

Files of any type and up to 100MB can be uploaded to the dataset page where anyone with access can download them. These should not contain any data for this dataset, as access to them is managed separately.

Links

Links can be added with display names to direct someone to another URL with more information.

License

This is where you can add the license information about your dataset's redistribution policies. If this data is governed by a common redistribution license you can select it here from the menu of standard licenses. If you want to reference a license that isn't listed here you can include the link, or upload a custom license. This will be displayed on the dataset front page to let others know how they can use your data. This information will be included on the dataset's DOI.

Do you think a common license is missing? to let us know what you'd like to see here.

Funding

If this dataset was funded by an institution you'd like to recognize, this is the section where you can include information about funder(s). You'll need the funding organization's name and ROR, as well as an award number if applicable. You can add multiple funders to each dataset. This information will be included on the dataset's DOI.

Contact

This section should be used to let someone viewing this dataset know how to get in touch if there is any issue or question.

Custom sections

You can create documentation sections with their own titles and assign them custom access levels.

By default, all dataset documentation is visible to anyone with to the dataset. However, there may be some content in the documentation that is sensitive — for example, information about named variables that would require metadata access.

To protect this information you can create a custom documentation section with a more restrictive access level. Users without the appropriate level of access will only see a placeholder for that section of the documentation.

Tags

In addition to documentation, you may add up to 25 tags to your dataset, which will help researchers discover and understand the dataset.

Other metadata

Additionally, information about the dataset's size and temporal range will be automatically computed from the Additional table documentation, as well as the , will be indexed and surfaced as part of the .

Tables

The Tables tab contains a list of all tables within the current version of the dataset. All datasets that contain data will have at least one table, as tables are used to represent all data types stored on Redivis, including geospatial and unstructured data. Learn more about dataset tables.

Files

The Files tab contains a list of all the unstructured files that have been uploaded to the dataset. These files will also be mirrored in corresponding file index tables, though this interface can provide a more familiar view as a directory structure. Learn more about dataset files.

Usage

The information on the Usage tab of the dataset page is automatically generated based on how researchers are working with the dataset on Redivis, including:

  • Views: How many times this dataset page was loaded. These are only counted for logged-in users, and a user is only recounted after 1hr of not viewing the dataset.

  • Workflows: How many workflows include this dataset.

  • Featured workflows: How many workflows featured by organizations include this dataset.

  • Variable usage: This list shows all variables across all tables in the dataset, sorted by their popularity. This popularity is calculated based on the number of times that variable is referenced when this table is used in workflows.

Creating a dataset

All datasets on Redivis are owned by either an individual user or an organization.

To create a dataset owned by you, navigate to your workspace, and under Datasets, click the + New dataset button. To create a dataset owned by an organization that you administer, navigate to the organization's administrator panel, and under Datasets, choose to create a new dataset.

For step-by-step guides for creating and editing datasets, see the accompanying Create & manage datasets guide

Editing datasets

To edit a dataset, navigate to your workspace (or organization administrator panel), and click on the dataset that you would like to edit. You will also see an Edit dataset link at the top of the page for all datasets that you are an editor of.

From within the dataset editor, you will be able to update data, metadata, and release a new version of the dataset.

On the Overview tab, you can populate the various metadata fields of the dataset. Note that any fields that you leave blank will not be shown to users on the dataset's page.

On the Table and Files tabs, you can upload the various data contents associated with the dataset. Learn more about uploading data here.

Assigning editors

If a dataset is owned by a user, that user has full ability to edit and modify the dataset. If a dataset is owned by an organization, all administrators of the organization will have such rights.

You can also add other editors to the dataset by selecting Configure access at the top right of the dataset, and adding the specific individuals as editors. Editors will be able to upload data, modify metadata, and release new versions of the dataset, but they cannot modify any of the dataset's access rules or change the dataset's published status.

Dataset editors will be able to find this dataset in their workspace on the Datasets tab.

Dataset settings

On the dataset editor page, you will see an additional Settings tab, where various options for the dataset can be configured:

Dataset name

You can rename the dataset here (or by clicking on the dataset title within the dataset editor).

Administrator notes

These notes are only visible to editors of the dataset, and may provide helpful documentation for administrative processes around maintaining the dataset.

Published status

Control whether the dataset is published. If a dataset is unpublished, only editors will have access to the dataset, though you can configure access for other users such that they'll gain access when the dataset becomes published.

All datasets are initially unpublished when created, and become published when the first version is released. If you ever need to quickly remove a dataset from circulation, you can unpublish the dataset.

Featured status

[Organization datasets only]. Whether the dataset should be featured on your organization's landing page.

DOI status

[Organization datasets only]. If your organization has configured a DOI provider, you can enable DOI issuance for your organization's datasets. Based on your organization's settings, this will either be default-enabled or default-disabled for new datasets.

If you enable DOI issuance on an existing dataset, DOIs will be back-issued for all non-deleted version of the dataset. These DOIs will become permanent after 7 days.

If you disable DOI issuance on an existing dataset, any draft DOIs (less than 7 days old, or on an unreleased version) will be deleted. Any future versions will not be issued a DOI.

Learn more about dataset DOIs.

Transferring datasets

Within the dataset settings, you can transfer a dataset to a new owner. The following transfers are currently supported:

  • User -> Organization: the individual performing the transfer must be the owner or an editor of the dataset, and an administrator of the receiving organization).

  • Organization -> Organization: the individual performing the transfer must be an administrator of both organization.

Access & visibility

When first creating a dataset, you will be able to specify certain access rules for the dataset. However, these access rules will only take effect once the dataset is published, and can be modified before the dataset's publication.

While in the unpublished state, only dataset editors will be able to view the dataset, and the dataset won't be accessible to other users nor will it show up in any search results. Once a dataset has been published, visibility to the dataset will be governed by its access configuration. Datasets can be unpublished at any time to immediately cut off access for all non-editors.

Once published, and pursuant to the dataset's access configuration, all other users on Redivis will have one of the following access levels to the dataset: none, overview, metadata, sample, data. Learn more about dataset access levels.

About dataset DOIs

When enabled, DOIs are issued for all released versions of a dataset. For datasets with more than one version, a "canonical" DOI for the dataset will be issued as well.

When issued, DOIs remain in a draft state for 7 days, and will be deleted if the dataset's DOI configuration is changed. After 7 days, DOIs become permanent (this 7 day counter only begins once a version is released; unreleased versions always have a draft DOI).

DOI lifecycle

DOIs are issued for all released versions of a dataset. For datasets with more than one version, a "canonical" DOI for the dataset will be issued as well.

When issued, DOIs remain in a draft state for 7 days, and will be deleted if the dataset's DOI configuration is disabled while still a draft. After 7 days, DOIs become permanent (this 7 day counter only begins once a version is released; unreleased versions always have a draft DOI).

If a dataset is not publicly visible, its DOI will be registered, but it won't be findable – meaning that web crawlers won't index the DOI, and it won't be part of search results on platforms like DataCite. In such a case, the dataset's metadata will be redacted as well (see below).

When a dataset or version is deleted, any draft DOIs will be deleted. All other DOIs will be moved to the registered (non-findable) state, and they will continue to resolve to appropriate "tombstone" page for the dataset.

DOI metadata

DOIs are more than just persistent identifiers — they are accompanied by rich metadata that allows other tools to surface your datasets and link them to the broader body of academic work. Every field on the dataset page maps to corresponding entries in the DataCite metadata; you can view these metadata by clicking Metadata -> DataCite on the right of the "Overview" section of the dataset page. Redivis will automatically sync the DataCite metadata whenever your dataset is updated.

A note on metadata and access:

Redivis will only ever publish metadata to DataCite that is publicly visible. This means that if your dataset is completely hidden from public view, no metadata will be propagated to DataCite (including the dataset's name). Instead, a placeholder mentioning that "This dataset has been hidden from public view" will be used in place, and only the name of the dataset owner, and the dataset's contact information (if present), will be published.

While this situation isn't ideal, it allows DOIs to still be used for disambiguation when discussing the dataset, and provides a potential pathway for future investigators to contact the data owner.

Note that metadata associated with your dataset will automatically be updated if you change the public access level of your dataset.

DOIs and dataset versions

Redivis creates a DOI for every version of a dataset, allowing for authoritative references and reproducibility of data used in researchers' work. Once a dataset has more than one version, Redivis will also issue a DOI for the dataset as a whole, whose metadata will reflect the latest version of the dataset. All version DOIs will point to this canonical DOI, and also link to each other (as previous / next versions) in the DataCite metadata. This structure improves discoverability and disambiguation when referencing versioned datasets.

Discontinue or prevent DOIs

If you've uploaded your DOI issuing credentials in the Setting tab, all new datasets will by default be issued a DOI upon publishing and on every version released. If you would like to disable issuing DOIs for a specific dataset you can do so in the Setting tab of that dataset. Here you'll see a switch you can turn off to stop issuing DOIs. If the dataset is already published it will have a DOI for the dataset and all published versions that will not be removed. If you turn this switch on at any point in the future all versions will be back-issued DOIs.

organization to issue DOIs
Contact us
overview access
metadata on its tables.
variable metadata
dataset discovery process
https://redivis.com/datasets/7br5-41440fjzk

Step: Filter

Overview

The Filter step will select for rows that meet a certain set of conditions.

Example starting data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Filter out rows with scores less than 70.

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | sam     | 74     |
 *---------+--------*/

Step structure:

Basic state

  • A filter step will be made up of one or more filter blocks with a completed condition.

  • When there are multiple filter blocks in a step, conditions in all blocks must be met for rows to be kept.

  • If you have a more complex filter statement that is dependent on multiple nested conditions you can press the + button to expand the filter block.

Expanded state

  • When multiple conditions are needed in a block, you must specify how they relate to each other (AND vs OR)

  • Any nested conditions are resolved before higher level conditions.

Field descriptions

Variable(s)

The variable whose values will be evaluated.

Depending on the operator selected, it may contain multiple variables. Use wildcard characters (e.g. * for bulk selection.

[Operator]

How the two sides of the condition will be evaluated.

Value(s) or Variable(s)

The variable or value which the previously chosen Variable(s) will be evaluated against. Depending on the operator selected, it may contain multiple variables, values, or a . Use wildcard characters (e.g. * for bulk selection.

Examples

Example 1: Basic filter

Lets say we only want to reduce our table to only contain information about results from the final test.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Variable(s): The variable test is where the final value we want to evaluate on is located.

  • [Operator]: We want fields that exactly match, so we choose =.

  • Value(s) or variable(s): We want to only keep rows where final is present, so we put that here.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Example 2: Multiple conditions

Let's say we don't just want values from the final but only those from the final with a score above 60.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • We input the data as in the above example, but since we now have two conditions, we have to decided how they relate to each other. In this case we want data that meets all the conditions so select All conditions must be satisfied (AND).

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Example 3: Nested conditions

Let's say we want to keep all data from the final greater than 60, or any any scores above 85.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • We want to keep all rows where scores are over 85 OR are from the final and over 60. So we set up the final and over 60 conditions under an AND umbrella, and nest that under the OR umbrella alongside our condition about scores over 85.

  • When executing, the nested conditions (scores from the final over 60) will be evaluated first to be true or false. Then the higher level condition (scores over 85 OR (scores about the final and over 60) will be evaluated. Any rows that meet this higher level condition will be kept in the output, and any that do not will be discarded.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Example 4: Comparing variables and value lists

Let's say we only want to keep rows where scores are greater than the average score, and that are for our selected students. Right now our selected students are Jane and Pat but we know that might change in the future.

Starting data:

/*---------+-------+---------+------------+---------------*
 | test    | score | student | date       | score_average |
 +---------+-------+---------+------------+---------------+
 | quiz    | 83    | jane    | 2020-04-01 | 69            |
 | quiz    | 35    | pat     | 2020-04-01 | 69            |
 | quiz    | 89    | sam     | 2020-04-01 | 69            |
 | midterm | 74    | jane    | 2020-05-01 | 76.3333       |
 | midterm | 62    | pat     | 2020-05-01 | 76.3333       |
 | midterm | 93    | sam     | 2020-05-01 | 76.3333       |
 | final   | 77    | jane    | 2020-06-01 | 76            |
 | final   | 59    | pat     | 2020-06-01 | 76            |
 | final   | 92    | sam     | 2020-06-01 | 76            |
 *---------+-------+---------+------------+---------------*/

Input fields:

  • First step, we create a value list called selected_students with the values jane and pat on it. Then we can select it as compared to student.

    • With multiple inputs on either side of a condition, the condition is evaluated as an OR. So this condition will evaluate to true if the value in student equals any value on the list (jane or pat).

    • While we could have input jane and pat into the right side of this condition and gotten the same result, using a list makes it easy in the future to change our selected students centrally and update it everywhere the list is used.

  • For our second condition, we want to evaluate how two values in the same row compare to each other.

    • When executed, this condition will look at each row to check if the value in score is greater than the value in score_average. If so, then the condition is met.

Output data:

/*---------+-------+---------+------------+---------------*
 | test    | score | student | date       | score_average |
 +---------+-------+---------+------------+---------------+
 | quiz    | 83    | jane    | 2020-04-01 | 69            |
 | midterm | 74    | jane    | 2020-05-01 | 76.3333       |
 | final   | 77    | jane    | 2020-06-01 | 76            |
 *---------+-------+---------+------------+---------------*/

Reference: Comparison statements

A comparison always evaluates to either TRUE or FALSE.

Comparisons are made up of one or more comparison rows. A comparison row always evaluates to TRUE or FALSE; multiple rows can be nested together with a logical AND/OR.

A comparison row is made up of three components:

Left expression

The left expression can contain variable parameters from the source table, joined tables, as well as any newly created variables. All referenced new variables must be "upstream" from the current comparison, with the exception of joins, which may reference new variables that are constructed by any variables upstream of that join.

Depending on the operator selected, the left expression may contain multiple variables. In this case, each left expression will be evaluated against the right expression(s), and logically joined via an OR. If you want to take the logical AND of multiple variables, create a separate comparison row for each variable.

Multiple left hand values are only supported for = and like operators.

Where multiple variables can be entered in a comparison, you can use * in the interface to select all matching results.

For example, typing DIAG* and pressing enter will add all variables beginning with "DIAG" to this field.

Operator

Redivis supports the following operators:

=, !=

Checks if any the value(s) in the left expression are (not) equal to any values in the right expression. NULLs are treated as equivalent (NULL == NULL -> TRUE and NULL != NULL -> FALSE)

>, >=, <, <=,

Checks if the value in the left expression is less than, greater than, etc. the right expression. String comparisons are lexicographically ordered, other data types are based on the numeric / temporal order of that value. Comparisons between NULL values will always be false.

like / ! like

Checks if the string(s) in the left hand expression matches specified pattern(s) in the right hand expression. The pattern may contain the following characters:

  • A percent sign "%" matches any number of characters

  • An underscore "_" matches a single character

  • You can escape "\", "_", or "%" using one backslash. For example, "\%"

Right expression

The right expression can contain any variables allowed in the left expression, as well as literal values and lists. The comparison row will evaluate to TRUE when the left hand expression matches any of the right hand expressions, except for the != and !like comparators, where the comparison will evaluate to true if all values are not equal to / not like the left expression.

To match against a null datum (empty cell), you must specify the special literal value NULL here.

Tabular data

Overview

In order to create new , or update existing tables with new data, you will need to ingest a valid tabular data file from a . Redivis supports numerous tabular data formats, with robust error handling and . You can also perform scripted imports via the .

Tables on Redivis can be made up of one or more tabular file uploads. The use case applies when data with the same general schema is broken up across files. For example, a dataset where there is a separate file for each year. In general, it's best to combine such files into a single table, as it is easier for researchers to query a single "tall" table than multiple tables (and Redivis's high-performance query engine keeps these queries quick, even at terabyte scale).

Supported tabular file types

Redivis will try to detect the file type based on the file extension, though you can manually specify the type as needed.

Type
Description
Notes

Uploading compressed (gzipped) files:

Generally, you should upload uncompressed data files to Redivis, as uncompressed files can be read in parallel and thus upload substantially faster.

If you prefer to store your source data in a compressed format, Avro, Parquet, and ORC are the preferred data formats, as these support parallelized compressed data ingestion at the row level.

Redivis will decompress text-delimited files, though the data ingest process may be substantially slower. If your file is compressed, it must have the .gz file extension if you're uploading locally (e.g., my_data.csv.gz) or have it's header set to Content-Encoding: gzip if served from a URL or cloud storage location.

Quotas & limits

Most upload types are limited to 5TB per upload. Stata, SPSS, XLS, and Shapefile ZIP directories are limited to 100GB.

Full documentation on the limits for upload file size, max variables, and other parameters are .

Working with delimited files

A text-delimited file is a file that uses a specific character (the delimiter) to separate columns, with newlines separating rows.

Delimited file requirements

  • Must be UTF-8 encoded (ASCII is a valid subset of UTF-8)

  • Quote characters in cells must be properly escaped. For example, if a cell contains the content: Jane said, "Why hasn't this been figured out by now?" it must be encoded as: "Jane said, ""Why hasn't this been figured out by now?"""

  • The quote character must be used to escape the quote character. For example, the sequence \" is not valid for an escaped quote; it must be ""

  • Empty strings will be converted to null values

Advanced options for delimited files

Has header row

Specifies whether the first row is a header containing the variable names. This will cause data to be read beginning on the 2nd row. If you don't provide a header in your file, variables will be automatically created as var1, var2, var3, etc...

Skip invalid records

By default, an upload will fail if an invalid record is encountered. This includes a record that has a mismatched number of columns, or is otherwise not parsable. If this box is checked, the number of skipped records will be displayed on each upload once it has been imported.

Allow jagged rows

Whether to allow rows that contain fewer or more columns than the first row of your file. It is recommended to leave this option unchecked, as jagged rows are generally a sign of a parsing error that should be remedied by changing other options or fixing the file.

Cells contain line breaks

Whether newlines exist within specific data cells (e.g., paragraphs of text). If set to Auto, Redivis will determine the value based on analysis of the beginning of the file.

It is best to only set this value to "Yes" if you know your data contain line breaks, as this will slow down the import and may cause incorrect error reporting.

Delimiter

The delimiter will be auto-inferred based upon an analysis of the file being uploaded. In rare cases, this inference may fail; you can specify the delimiter to override this inference.

Quote character

Specify the character used to escape delimiters. Generally " , though some files may not have a quote character (in which case, they must not include the delimiter within any cells).

Escape character

Cells containing a quote character must have that character escaped.

Typically, the escape sequence character is the same as the quote character, but some files may use a different value, such as a backslash (\).

Null markers

A list of up to 10 values (case-sensitive) that should be interpreted as NULL on import. For example, if your file contains the value NA to represent nulls, you can specify this value here such that these values are read as nulls, rather than the string literal "NA".

Variable names and types

Naming variables

Variable names are automatically inferred from the source data. They can only contain alphanumeric or underscore characters, and must start with a letter or underscore. Any invalid characters will be replaced with an underscore (_).

Variable names must be unique within the table. If the same variable is found more than once in any given file, it will automatically have a counter appended to it (e.g., "variable_2").

The max number of characters for a variable name is 60. Any names with more characters will be truncated.

Variable type inference

All values of a variable must be compatible with its type. Redivis will automatically choose the most specific, for a variable, with string being the default type.

Please note the following rules:

  • If all values of a variable are null, its type will be string

  • Numeric values with leading zeros will be stored as string in order to preserve the leading zeros (e.g., 000583 )

  • Data stored with decimal values will be stored as a float , even if that value is a valid integer (e.g., 1.0 ).

  • Temporal data types should be formatted using the canonical types below. Redivis will attempt to parse other common date(time) formats, though this will only be successful when the format is unambiguous and internally consistent.

    • Date: YYYY-[M]M-[D]D

    • DateTime: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]

    • Time: [H]H:[M]M:[S]S[.DDDDDD]

Working with multiple uploads

You can create up to 500 uploads per table, per version. Files will automatically be appended to each other based on their variable names (case insensitive), with the goal of creating one continuous table with a consistent schema.

Missing variables

If a variable is missing in some of the files you uploaded, the values for the missing variable will be set to null for all rows in the upload.

Conflicting variable types

If files have conflicting across a given variable, the lowest-denominator type for that variable is chosen when the files are combined.

Pseudo-variables associated with uploads

In certain situations, there may be additional metadata associated with an upload that isn't otherwise present in your data (e.g., a date that is encoded in the file name). To allow you to add such information within your , Redivis exposes two "pseudo" variables on all unreleased tables:

  • _IMPORT_ID: The unique identifier associated with the upload. This will most often be used when initiating uploads .

  • _IMPORT_NAME: The name of the upload, typically the associated file name.

These pseudo-variables won't be shown in the table's list of variables and won't be present when listing rows, but can be queried via SQL and will show up alongside other variables when cleaning unreleased data in . After a table is released, these pseudo-variables will no longer be available.

Error handling

A file may fail to import due to several reasons; in each case, Redivis endeavors to provide a clear error message for you to fix the error.

In order to view full error information, including a snapshot of where the error occurred in your source file (when applicable), click on the failed upload in the upload manager

Network issues

When transferring a file from your computer (or more rarely, from other import sources), there may be an interruption to the internet connection that prevents the file from being fully uploaded. In these cases, you should simply try uploading the file again.

Invalid or corrupted source data

Data invalidity is most common when uploading text-delimited files, though it can happen with any file format. While some data invalidity errors may require further investigation outside of Redivis, others may be due to provided in the file upload process. When possible, Redivis will display ~1000 characters that are near the error in the source file, allowing you to identify the potential source of failure.

Common import errors

The Redivis data import tool has been built to gracefully handle a wide range of data formats and encodings. However, errors can still occur if the source data is "invalid"; some common problems (and their solutions) are outlined below.

If you're still unable to resolve issue, please don't hesitate to reach out to [email protected]; we'd be happy to assist!

Bad CSV dump from SQL database

Some SQL databases and tutorials will generate invalid CSV escape sequences by default. Specifically:

The "proper" escape sequence is a doubling of the quote character. For MySQL, this would look like

If you only have access to the invalid file generated by a previous database dump, you can specify a custom Quote Character of \ in the advanced import options, and Redivis will reformat the file as part of the ingest process (Redivis will also auto-detect this custom escape sequence in many scenarios). Using a custom escape sequence may cause data import processing to take a bit longer.

Line breaks within cells

If your data has paragraphs of text within a particular data cell, and the "Has quoted newlines" advanced option isn't set, the data import may fail (see above screenshot for example). Redivis will automatically set this option to true if it identifies a quoted newline in the top ~1000 records of the file, but if quoted newlines don't occur until later, you'll need to make sure to set this option manually for the import to succeed.

Connectivity and timeout errors

While rare, it is always possible that data transfers will be interrupted by the vagaries of networking. If this happens, we recommend simply retrying your upload. If the problem persists, please reach out to [email protected].

Step: Create variables

Overview

A Create variables block uses methods to make new variables based on existing data. The method selected will dictate how the block operates.

Example starting data:

Example output data:

Creating a new variable for what letter grade each student got on their test.

Step structure

  • There will be at least one new variable block where you will define and complete a new variable method.

  • When multiple blocks exist, the variables will be created in sequence and can reference each other.

Field definitions

Field
Definition

Some methods are only available for certain variable types, so you might need to variables before you can use them in the method you've chosen.

Analytic methods

Analytic methods are a special category of method that allow for each row to be computed individually. When using an analytic new variable method new tools become available:

  • A partition segments data based on values in the selected partition variables, and computes the analytic method within those segments separately.

  • A window defines which records are used to compute the analytic method. Usually this would be accompanied with an order clause.

Examples 3 and 4 below go into more detail for analytic methods.

Examples

Example 1: Date extract

A simple new variable format is extracting one part of a variable into a new one. In our data, we have a full date including year, month, and day, but we want to extract the year for use elsewhere.

Starting data:

Input fields:

  • Name: We name our new variable test_month.

  • Method: Once we chose Date extract as our method, new fields appear

    • Variable to extract: The variable with the data we want to extract from (date).

    • Date part: The part of the variable we want to extract. Since date is a Date type variable the information about date part is stored in the format and can be easily extracted. We choose Month since that is what we want to create.

  • More methods to choose from can be found in the .

Output data:

Example 2: Case method

The Case method (if/else) allows us to specify one or more conditions to create the values of the new variable.

For example, we can create a variable capturing the grade of each test in our data.

Starting data:

Input fields:

  • When run, each row of the table will be evaluated in each section of the case statement (If, Else if, Else set to) until it matches something and is set to the corresponding value.

    • After the first section where a row meets the criteria, the new variable value will be set and no other sections will be evaluated.

  • Comparison statements used here will operate the same as they do in the step and can be nested in the same way.

Output data:

Example 3: Partitioned analytic methods

We can use an analytic method to compute a value for each row (rather than the entire table). By using a partition we can define groups of rows to calculate across.

For example, in our grades data we can calculate the average score of each test.

Starting data:

Input fields:

  • Method: We want to calculate the average so select this analytic method.

  • Partition: This is where we define the groups in which the average will be calculated.

    • If one or more variables are selected here, the average for our new score_average variable will be computed across all rows that are the same in the selected variable(s).

    • If no variables are entered here, then the average will be computed across the entire table.

    • We want to average scores from the same test so we select test here.

  • Variable to aggregate: This variable contains the data we want to average.

  • Window: We don't want to define a window in this average so we leave it on All rows in the partition.

Output data:

To create the new variable, for each row with the same value in test, the values in score will be averaged together. So there are 3 rows with the value quiz in test which average to 69. So all rows with a quiz will show the same average of 69 in our new variable.

Example 4: Windowed analytic methods

We can use a window to calculate a moving average. A window will define how many rows before or after the current row to use when calculating the average.

Example data:

Input fields:

  • Partition: We choose not to use a partition in this example since our data does not need to be segmented.

  • Variable to aggregate: The numbers we want to average are in the score variable so we choose that here.

  • Window: We want to create a moving average based on one entry before and after the current row, so we select Rows here.

  • Order by: Our data is already ordered by date in this table, but if it wasn't we would definitely need to order on the relevant variable here.

  • Rows preceding / Rows following: This is where we define how many rows to include in the average.

Output data:

For each row, one row preceding and following is used to compute the score_average. So for the first row we average 10 and 10 (since no preceding rows exist, it is excluded). For the second row 10, 10, and 40 are averaged. This process repeats until the end of the table is reached.

Note that we could also use Range instead of Rows for our window if our question was time based (e.g. average score over 1 month preceding and 1 month following).

Example 5: Windowed & partitioned analytic methods

Continuing the previous example, if we had included a partition then this same process would be completed separately for the values in each partition. So if we had partitioned on test in this example, our outcome would look different

Output data:

Since quiz2 is in a separate partition, those rows are averaged separately.

Time

Current time

Returns the current time –>

Return type

time

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Format time

Returns a formatted string from a time –>

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

More details about format strings

Parse time

Parses a time from a string –>

Return type

time

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

More details about format strings

New time

Constructs a time from an hour, minute, and second –>

Return type

time

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Time add

Add a period of time to a time –>

Return type

time

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Time diff

Calculate distance between two times –>

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Time subtract

Subtract a period of time from a time –>

Return type

time

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Time truncate

Truncates a time to the nearest boundary –>

Return type

time

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Create an image classification model

This guide demonstrates using a Redivis workflow to train an ML model on a set of images stored in a Redivis dataset.

Workflow objective

This is an example workflow demonstrating image classification via Convolutional Neural Networks. It imports an example dataset containing several thousand test and training images of cats and dogs, with which we can train and evaluate our model.

This workflow is heavily adapted from its initial publication at: https://gsurma.medium.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8

! We also suggest you recreate this workflow as we go to best learn the process.

1. Explore data

All the image data we need is contained in the Demo organization dataset .

We can go to this dataset to browse it's tables to understand the structure of the data it contains.

We see three tables here, and all of them are file index tables. That means that each table contains an index of the files (unstructured data) this dataset contains, sorted by the folder the administrator uploaded them into. We can click on the Files tab of the dataset to see each file individually, and click on it to see a preview.

This dataset has three groupings of files:

  • Training images (we will use these to build the model)

  • Test images (images not included in the training set that we can verify the model with)

  • Example file types (unrelated to this workflow)

If we click on the Tables tab, and click on the training images table, we can see high level information about this set of files. We can see that there are 25,000 files, and when we click the Cells tab, all of the file names we can see end in .jpg. We can hover on these to see a preview of the image, and we can click on the file_id variable to see a preview of the image with more information.

2. Create a workflow

At the top of this dataset page we can click the Analyze in workflow button to get started working with this data.

You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.

3. Define a training set of images

We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with.

Define training set

We need to start by defining the training set, which conceptually means the set of images we know are cats and know are dogs to train the model on. Information about whether an image is a cat or dog is in the file name, so we need to pull it out into a new variable we can more easily sort on.

Click on the table Training images and create a . This interface is where we will define a query which will run against our source table and create an output table. You can choose to write the query in SQL but we will use the interface for this example since it is faster and easier to use.

Add a step and name the new variable is_cat. The method will be Regexp contains which allows us to easily identify presence of the string cat from the file_name variable. This new variable will be a boolean variable where true means the image contains a cat and false means it does not.

We want to include only some of our training set images into the training set we use to train the model, since we want to leave some aside to validate the model. So here we want to include exactly 5000 cat images and 5000 dog images. So we we will create a new variable rank and filter on it so that we only keep the first 5000 images of each type.

To do this, + Add block in the Create variables step and use the Rank method. This is an which means you will use the partition ability to partition on true and false values. For each partitioned value (true and false) a rank will be assigned.

Create a new step. Conceptually we will keep records up to 5000 in the rank variable, which means it will include 5000 true values and 5000 false values.

The final step in the transform is deciding which variables we want in our output table. We will keep our new boolean variable is_cat to use later, along with the file_id and file_name

With everything in place we can run this transform by clicking the Run button in the top right corner.

4. Sanity check the output table

Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.

Click on the output table below the transform to view it. We can see that it contains 10,000 records which is exactly what we expected

We can also inspect each variable further. If we click on the is_cat variable we can see that there are 5000 true values and 5000 false values, which shows that our filtering was successful. We can also validate that the method we used to determine if an image is a cat or a dog worked by clicking on the Cells tab. Here we can see that records marked True have "cat" in their file name, and when we hover on the file_ID value to see a preview, the image clearly contains a cat.

Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.

5. Define validation set of images

We need to create a set of image files separate from our training set where we know if the image contains a cat or dog. This will be used to validate the model training.

Create a new transform and take all the same steps as we did in the previous transform, but we will change the filter to keep images ranked 5001-7500, rather than 1-5000.

We will keep the same variables as we did in our training model, and then run this transform.

When we run this transform and inspect the output table we see what we expect here as well. There are 5000 total files and we can validate a few of them visually on the Cells tab.

6. Training the model in a notebook

Next we want to train and test a model using Python code the help of various Python libraries. Transforms are more powerful than notebooks but are based on SQL and operate linearly with only a single output table allowed. In order to work in Python, R, Stata, or SAS to generate visuals and other outputs we will create a node on our Training data output table.

When you create the notebook for the first time it will start up. Notebooks must be running to execute code.

Install packages

Redivis notebooks come with many common packages preinstalled, and you can install additional packages by clicking the Dependencies button and importing libraries in the code.

Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.

The main libraries used to create this model are and . You can view their documentation for further details.

Load training and validation sets

Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook. For this example we will remove this sample code to import data according to our library's parameters.

Define model parameters

This is where we will heavily rely on our selected libraries to build the model.

Model training

This is where we will train the model we just built using the image data we cleaned.

Evaluate model results

Now we will use the validation set to see how well our model works

Next steps

Perhaps we see something in this model we want to tweak, or we want to go back and change some of our underlying data. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and them.

Notebooks can also which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also this workflow to work on a similar analysis, or any table in this workflow for work elsewhere.

Statistical

Correlation

Returns the Pearson coefficient of correlation of a set of number pairs –>

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Covariance

Returns the covariance of a set of number pairs –>

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Standard deviation

Returns the standard deviation of all values –>

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Variance

Returns the variance of all values –>

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)
CURRENT_TIME([@time_zone])

@time_zone

enum

any of: ACDT, ACST, ACT, ACT, ACWST, ADT, AEDT, AEST, AFT, AKDT, AKST, AMST, AMT, AMT, ART, AST, AST, AWST, AZOST, AZOT, AZT, BDT, BIOT, BIT, BOT, BRST, BRT, BST, BST, BST, BTT, CAT, CCT, CDT, CDT, CEST, CET, CHADT, CHAST, CHOT, CHOST, CHST, CHUT, CIST, CIT, CKT, CLST, CLT, COST, COT, CST, CST, CST, CT, CVT, CWST, CXT, DAVT, DDUT, DFT, EASST, EAST, EAT, ECT, ECT, EDT, EEST, EET, EGST, EGT, EIT, EST, FET, FJT, FKST, FKT, FNT, GALT, GAMT, GET, GFT, GILT, GIT, GMT, GST, GST, GYT, HDT, HAEC, HST, HKT, HMT, HOVST, HOVT, ICT, IDLW, IDT, IOT, IRDT, IRKT, IRST, IST, IST, IST, JST, KALT, KGT, KOST, KRAT, KST, LHST, LHST, LINT, MAGT, MART, MAWT, MDT, MET, MEST, MHT, MIST, MIT, MMT, MSK, MST, MST, MUT, MVT, MYT, NCT, NDT, NFT, NPT, NST, NT, NUT, NZDT, NZST, OMST, ORAT, PDT, PET, PETT, PGT, PHOT, PHT, PKT, PMDT, PMST, PONT, PST, PST, PYST, PYT, RET, ROTT, SAKT, SAMT, SAST, SBT, SCT, SDT, SGT, SLST, SRET, SRT, SST, SST, SYOT, TAHT, THA, TFT, TJT, TKT, TLT, TMT, TRT, TOT, TVT, ULAST, ULAT, UTC, UYST, UYT, UZT, VET, VLAT, VOLT, VOST, VUT, WAKT, WAST, WAT, WEST, WET, WIT, WST, YAKT, YEKT

false

(UTC (default))

FORMAT_TIME(@format_string, @time_expression)

@time_expression

variable or literal

any time

true

-

@format_string

literal

any string

true

(e.g., %H:%M:%S)

[@safe]PARSE_TIME(@format_string, @time_string)

@time_string

variable or literal

any string

true

-

@format_string

literal

any string

true

(e.g., %H:%M:%S)

@safe

boolean

any boolean

true

-

TIME(@hour, @minute, @second)

@hour

variable or literal

any integer

true

-

@minute

variable or literal

any integer

true

-

@second

variable or literal

any integer

true

-

TIME_ADD(@time_expression, INTERVAL @integer_expression @time_part)

@integer_expression

variable or literal

any integer

true

-

@time_part

enum

any of: hour, minute, second, millisecond, microsecond

true

((e.g., hours))

@time_expression

variable or literal

any time

true

-

TIME_DIFF(@time_expression, @time_expression_2, @time_part)

@time_expression_2

variable or literal

any time

true

-

@time_expression

variable or literal

any time

true

-

@time_part

enum

any of: hour, minute, second, millisecond, microsecond

true

((e.g., hours))

TIME_SUB(@time_expression, INTERVAL @integer_expression @time_part)

@integer_expression

variable or literal

any integer

true

-

@time_part

enum

any of: hour, minute, second, millisecond, microsecond

true

((e.g., days))

@time_expression

variable or literal

any time

true

-

TIME_TRUNC(@time_expression, @time_truncate_part)

@time_expression

variable or literal

any time

true

-

@time_truncate_part

enum

any of: hour, minute, second, millisecond, microsecond

true

((e.g., hours))

learn more
learn more
here
learn more
here
learn more
learn more
learn more
learn more
learn more
CORR(@expression, @expression_2)

@expression

variable or literal

any float

true

-

@expression_2

variable or literal

any float

true

-

COVAR_@sample_type(@expression, @expression_2)

@sample_type

enum

any of: SAMP, POP

false

(Sample (default))

@expression

variable or literal

any float

true

-

@expression_2

variable or literal

any float

true

-

STDDEV_@sample_type(@expression)

@sample_type

enum

any of: SAMP, POP

false

(Sample (default))

@expression

variable or literal

any float

true

-

var_@sample_type(@expression)

@sample_type

enum

any of: SAMP, POP

false

(Sample (default))

@expression

variable or literal

any float

true

-

learn more
learn more
learn more
learn more
/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/
/*---------+--------+-------*
 | student | score  | grade |
 +---------+--------+-------+
 | jane    | 83     | B     |
 | neal    | 35     | F     |
 | sam     | 74     | C     |
 | pat     | 62     | D     |
 *---------+--------+-------*/

Name

The name of the variable being created. This must follow all naming standards

Method

The way that the new variable will be created. Choosing this will bring up additional fields to complete specific to the chosen method. See all methods here.

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/
/*---------+-------+---------+------------+--------------*
 | test    | score | student | date       | test_month   |
 +---------+-------+---------+------------+--------------+
 | quiz    | 83    | jane    | 2020-04-01 | April        |
 | quiz    | 35    | pat     | 2020-04-01 | April        |
 | quiz    | 89    | sam     | 2020-04-01 | April        |
 | midterm | 74    | jane    | 2020-05-01 | May          |
 | midterm | 62    | pat     | 2020-05-01 | May          |
 | midterm | 93    | sam     | 2020-05-01 | May          |
 | final   | 77    | jane    | 2020-06-01 | June         |
 | final   | 59    | pat     | 2020-06-01 | June         |
 | final   | 92    | sam     | 2020-06-01 | June         |
 *---------+-------+---------+------------+--------------*/
/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/
/*---------+-------+---------+------------+----------------*
 | test    | score | student | date       | letter_grade   |
 +---------+-------+---------+------------+----------------+
 | quiz    | 83    | jane    | 2020-04-01 | B              |
 | quiz    | 35    | pat     | 2020-04-01 | F              |
 | quiz    | 89    | sam     | 2020-04-01 | B              |
 | midterm | 74    | jane    | 2020-05-01 | C              |
 | midterm | 62    | pat     | 2020-05-01 | F              |
 | midterm | 93    | sam     | 2020-05-01 | A              |
 | final   | 77    | jane    | 2020-06-01 | C              |
 | final   | 59    | pat     | 2020-06-01 | F              |
 | final   | 92    | sam     | 2020-06-01 | A              |
 *---------+-------+---------+------------+----------------*/
/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/
/*---------+-------+---------+------------+---------------*
 | test    | score | student | date       | score_average |
 +---------+-------+---------+------------+---------------+
 | quiz    | 83    | jane    | 2020-04-01 | 69            |
 | quiz    | 35    | pat     | 2020-04-01 | 69            |
 | quiz    | 89    | sam     | 2020-04-01 | 69            |
 | midterm | 74    | jane    | 2020-05-01 | 76.3333       |
 | midterm | 62    | pat     | 2020-05-01 | 76.3333       |
 | midterm | 93    | sam     | 2020-05-01 | 76.3333       |
 | final   | 77    | jane    | 2020-06-01 | 76            |
 | final   | 59    | pat     | 2020-06-01 | 76            |
 | final   | 92    | sam     | 2020-06-01 | 76            |
 *---------+-------+---------+------------+---------------*/
/*-------+----------+------------*
 | score | student  | date       |
 +-------+----------+------------+
 | 10    | quiz1    | 2020-01-01 |
 | 10    | quiz1    | 2020-02-01 |
 | 40    | quiz1    | 2020-03-01 |
 | 30    | quiz2    | 2020-04-01 |
 *-------+---------+------------*/
/*-------+---------+------------*---------------+
 | score | test    | date       | score_average |
 +-------+---------+------------+---------------+
 | 10    | quiz1   | 2020-01-01 | 10            |
 | 10    | quiz1   | 2020-02-01 | 20            |
 | 40    | quiz1   | 2020-03-01 | 26.6667       |
 | 30    | quiz2   | 2020-04-01 | 35            |
 *-------+---------+------------+---------------*/
/*-------+---------+------------*---------------+
 | score | test    | date       | score_average |
 +-------+---------+------------+---------------+
 | 10    | quiz1   | 2020-01-01 | 10            |
 | 10    | quiz1   | 2020-02-01 | 20            |
 | 40    | quiz1   | 2020-03-01 | 25            |
 | 30    | quiz2   | 2020-04-01 | 30            |
 *-------+---------+------------+---------------*/
retype
Variable creation methods
Filter
value list

Data repository characteristics

Overview

Redivis is designed to support data-driven research throughout the research lifecycle, and can serve as a permanent repository for your data, analytical workflows, and data derivatives.

In May 2022, the Subcommittee on Open Science (SOS) of the United States Office of Science and Technology Policy (OSTP) released a document outlining the "desirable characteristics of data repositories".

These characteristics are intended to help agencies direct Federally funded researchers toward repositories that enable management and sharing of research data consistent with the principles of FAIR data practices. Various agencies have adopted these guidelines, including the NIH.

Redivis is specifically designed to meet these desirable characteristics of a data repository, outlined below:

Desirable Characteristics for All Data Repositories

Unique Persistent Identifiers

✅ Assigns a persistent identifier (PID), such as a DOI

✅ Identifier points to a persistent landing page

Long-Term Sustainability

✅ Plan for long-term data management

✅ Maintain integrity, authenticity, and availability of datasets

✅ Stable technical infrastructure

✅ Stable funding plans

✅ Contingency plans to ensure data are available and maintained

Metadata

✅ Datasets accompanied by metadata

✅ Aids in the easy discovery, reuse, and citation of datasets

✅ Schema appropriate to relevant data communities

Curation and Quality Assurance

✅ Provide or allow others to provide expert curation

✅ Quality assurance for accuracy and integrity of datasets and metadata

Free and Easy Access

✅ Broad, equitable, and maximally open access to datasets and metadata

✅ Access is free of charge in a timely manner consistent with privacy

Broad and Measured Reuse

✅ Makes datasets and metadata available to reuse

✅ Provides ability to measure attribution, citation, and reuse of data

Clear Use Guidance

✅ Provides documentation for access and use

Security and Integrity

✅ Prevents unauthorized access, modification, and release of data

Confidentiality

✅ Ensures administrative, technical, and physical safeguards

✅ Continuous monitoring of requirements

Common Format

✅ Download, access, and export available in non-proprietary formats

Provenance

✅ Ability to record the origin, chain of custody, and modification of data or metadata

Retention Policy

✅ Provides policy for data retention

Additional Considerations for Human Data

Fidelity to Consent

✅ Utilizes consistent consent

Restricted Use Compliant

✅ Enforces data use restrictions

Privacy

✅ Implements measures to protect data from inappropriate access.

Plan for Breach

✅ Has a response plan for detected data breaches.

Download Control

✅ Controls and audits access to and download of datasets.

Violations

✅ Has procedures for addressing violations and data mismanagement.

Request Review

✅ Process for reviewing data access requests.


Detailed information

Unique Persistent Identifiers

Assigns datasets a citable, unique persistent identifier, such as a digital object identifier (DOI) or accession number, to support data discovery, reporting, and research assessment. The identifier points to a persistent landing page that remains accessible even if the dataset is de-accessioned or no longer available.

Data can be uploaded to datasets within an organization, where every version of that dataset is assigned a DOI through the organization's DOI-issuing credentials. DOIs will always resolve to the URL of the dataset page. In the case when a dataset is restricted or deleted, base metadata will remain available.

Long-Term Sustainability

Has a plan for long-term management of data, including maintaining integrity, authenticity, and availability of datasets; building on a stable technical infrastructure and funding plans; and having contingency plans to ensure data are available and maintained during and after unforeseen events.

Redivis uses highly-available and redundant Google Cloud infrastructure to ensure data is stored securely and to the highest technical standards. Redivis maintains a formal disaster recovery and business continuity plan that is regularly exercised to ensure our ability to maintain availability and data durability during unforeseen events.

Redivis undergoes annual security audits and penetration testing by an external firm, and utilizes a formalized software development and review process to maintain the soundness of its technical infrastructure.

Funding for Redivis is provided by recurring annual subscriptions from its member academic institutions. This model provides consistent annual revenue to support the ongoing maintenance of the platform. Redivis is an employee-owned company without any external investors with an equity stake, allowing us solely focus on the needs of our customers, our employees, and our mission of improving accessibility in the research data science.

Metadata

Ensures datasets are accompanied by metadata to enable discovery, reuse, and citation of datasets, using schema that are appropriate to, and ideally widely used across, the community(ies) the repository serves. Domain-specific repositories would generally have more detailed metadata than generalist repositories.

All Redivis datasets contain extensive metadata and documentation. Some metadata fields (including summary statistics for all variables) are automatically generated. Some fields are optional depending on the editor's insight and preference. Every dataset has space for short and long-form text, supporting files, and links, alongside variable labels and descriptions.

Metadata is available in various machine readable formats, such as schema.org and DataCite JSON, and can be viewed through the interface or downloaded via the API.

As a generalist repository, the metadata schema is intentionally broad and flexible, but specific groups on Redivis can choose to enforce more specific metadata standards within their datasets.

Curation and Quality Assurance

Provides, or has a mechanism for others to provide, expert curation and quality assurance to improve the accuracy and integrity of datasets and metadata.

All datasets on Redivis are owned either by an Organization (curated by any administrator) or an individual User. Additional users may be added as editors to a dataset, so as to provide further curation and quality assurance.

It is ultimately up to the editors of a dataset to provide curation, though Redivis is designed to support this process as much as possible. Redivis automatically computes checksums and runs fixity checks on all uploaded files, and computes univariate summary statistics of all variables to aid in the quality assurance process. Metadata completeness is also reported to editors, encouraging them to provide as much information as possible.

Free and Easy Access

Provides broad, equitable, and maximally open access to datasets and their metadata free of charge in a timely manner after submission, consistent with legal and ethical limits required to maintain privacy and confidentiality, Tribal sovereignty, and protection of other sensitive data.

Datasets and workflows can be explored on Redivis without any requirement to have an account or log in. If someone would need to apply to access restricted data, or analyze and download data they will need to make an account to do so. All individual accounts are complete free and require no specific affiliation. Data access restrictions are set and maintained by the data owner.

Broad and Measured Reuse

Makes datasets and their metadata available with broadest possible terms of reuse; and provides the ability to measure attribution, citation, and reuse of data (i.e., through assignment of adequate metadata and unique PIDs).

The data owner can choose to publish any dataset publicly or set appropriate access restrictions based on the sensitivity of the data. Redivis imposes no additional limits on the availability and reuse of data. Any dataset that has a DOI can be cited and tracked in publications, and any reuse on Redivis is displayed on the dataset's usage tab.

Clear Use Guidance

Provides accompanying documentation describing terms of dataset access and use (e.g., particular licenses, need for approval by a data use committee).

Dataset owners can describe any usage agreements in their access requirements and have space to document any additional usage rules. Redivis imposes no additional limits on the availability and use of data.

Security and Integrity

Has documented measures in place to meet generally accepted criteria for preventing unauthorized access to, modification of, or release of data, with levels of security that are appropriate to the sensitivity of data.

Redivis is SOC2 certified and prioritizes technical security. There are multiple layers of administrative controls to make it clear what actions data owners are taking, and all actions whether administrator or researcher are automatically logged for review.

Redivis is well-designed to handle workflows around sharing of sensitive and high-risk data. It provides technical mechanisms to support the reuse of sensitive data when allowed, while enabling the enforcement of appropriate guardrails and access controls defined by data administrators.

Confidentiality

Has documented capabilities for ensuring that administrative, technical, and physical safeguards are employed to comply with applicable confidentiality, risk management, and continuous monitoring requirements for sensitive data.

Redivis is SOC2 certified and prioritizes technical security. All data is encrypted in transit and at rest, and stored on Google Cloud infrastructure that maintains robust technical and physical access controls.

In addition to an annual security audit, Redivis also undergoes annual penetration testing by an outside firm to further ensure the soundness of its security posture.

Common Format

Allows datasets and metadata downloaded, accessed, or exported from the repository to be in widely used, preferably non-proprietary, formats consistent with those used in the community(ies) the repository serves.

Data and metadata on Redivis can be imported and exported in multiple common formats. Data analysis is performed in common, generally open-source programming languages (SAS and Stata being available exceptions). Redivis does not introduce any of its own proprietary formats or programming languages.

Provenance

Has mechanisms in place to record the origin, chain of custody, and any modifications to submitted datasets and metadata.

All datasets contain Provenance documentation, which is automatically populated based on administrator actions. This information can be further edited or supplemented with additional related identifiers.

All modifications to a dataset or tracked in the adminstrative logs.

Retention Policy

Provides documentation on policies for data retention within the repository.

Redivis publishes a data retention policy. Data is always owned by the user or organization who uploads the dataset, and they have control over a dataset's presence and availability. The dataset owner may apply additional policies towards data retention.


Fidelity to Consent

Uses documented procedures to restrict dataset access and use to those that are consistent with participant consent and changes in consent.

Redivis provides extensive options to limit access on sensitive or restricted data, including allowing access on different levels (e.g. metadata, sample, data). Access is granted and revoked on Redivis instantly, allowing for immediate changes in access based on changing circumstances.

Access rules on Redivis are defined and enforced by the dataset owner.

Restricted Use Compliant

Uses documented procedures to communicate and enforce data use restrictions, such as preventing reidentification or redistribution to unauthorized users.

Redivis has built in data restrictions available to administrators. Data download or export can be restricted completely or limited only to administrator approval in order to prevent redistribution.

Data administrators can also communicate and collect formal acknowledgement of other use restrictions through access requirements. Moreover, data administrators can easily audit the use of restricted data in the audit logs to further check for and limit any non-compliance or other misuse.

Privacy

Implements and provides documentation of measures (for example, tiered access, credentialing of data users, security safeguards against potential breaches) to protect human subjects’ data from inappropriate access.

Redivis has a built-in tiered access system that allows administrators to restrict access to sensitive data. These controls apply to data derivatives as well, where any data output inherits the access rules of the source dataset(s) used to create that output. These controls on derivative data allow researchers to share analyses with colleagues without worrying about accidentally leaking sensitive information, since all collaborators will need to comply with the access rules in order to view those outputs.

Additional technical safeguards ensure data privacy. Data users can establish their identity through their institutional identity provider, and Redivis undergoes regular security audits to ensure the soundness of its security posture.

Plan for Breach

Has security measures that include a response plan for detected data breaches.

Redivis has detailed internal security protocols and documented security breach plans which are regularly exercised by technical personnel via tabletop exercises. All systems are continuously monitored for potential breaches, with immediate alert pathways and clear escalation protocols to respond to any breach.

Download Control

Controls and audits access to and download of datasets (if download is permitted).

Redivis has built in data restrictions available to administrators. Data download or export can be restricted completely or limited only to specific external systems / upon administrator approval. All data downloads are logged for subsequent audit and review. Downloads are only available to authenticated users who have access to the underlying data.

Violations

Has procedures for addressing violations of terms-of-use by users and data mismanagement by the repository.

Violations of the terms-of-use may lead to account suspension or revocation, as outlined in the terms. Additional technical controls are in place to prevent abuse or misuse of Redivis's systems.

As a matter of policy, Redivis aims to be as permissive as possible, recognizing that often misuse is the result of accidental behavior or misunderstanding. These controls are designed to protect the system for all users, and are not intended to ever be punitive towards good-faith actors.

Request Review

Makes use of an established and transparent process for reviewing data access requests.

Administrators can define access requirements on any restricted dataset. These access requirements are transparent to all users, and researchers must apply and be approved for a given set of requirements in order to gain access. Requirements also have space for both administrators and applicants to leave comments specifically in the context of the data application.

.csv, .tsv, .psv, .dsv, .txt, .tab, *

Delmited

Redivis will auto-infer the delimiter, or you can specify it manually. This is also the default format for files with missing file extensions.

See working with text-delimited files

.jsonl, .ndjson

JSON-lines (a.k.a. Newline-delimited JSON)

A newline-delimited list of JSON objects, with one object on each line. Each objects' keys correspond to the variable names in the table.

.json

JSON

Must be a JSON array of objects, where each top-level object represents one row in the table. The keys of each top-level object correspond to the variable names in the table.

Importing newline-delimited JSON (see above) will be significantly faster and is recommended for larger files. If your file is formatted as GeoJSON, but has the ending .json, make sure to explicitly choose "GeoJSON" as the file format.

.avro

Avro format

Compressed data blocks using the DEFLATE and Snappy codecs are supported.

Nested and repeated fields are not supported.

.parquet

Parquet format

Nested and repeated fields are not supported.

.orc

Orc format

Nested and repeated fields are not supported.

.sas7bdat

SAS data file

Default formats will be interpreted to the corresponding variable type, and variable labels will automatically be imported.

User defined formats (.sas7bcat) are not support.

.dta

Stata data file

Variable labels and value labels will automatically be imported.

.sav

SPSS data file

Variable labels and value labels will automatically be imported.

.xls, .xlsx

Excel file

Only the first sheet will be ingested. The legacy .xls format will have all dates and times represented as dateTimes. Due to the variability of excel files, and inconsistencies in how excel internally represents dates, this formatted is typically not recommended if other options are available.

Google sheets

Sheets file stored in Google Drive

Only the first tab of data will be ingested.

Incorrect encoding:
val1,val2,"string with \"quotes\" inside"

Correct encoding:
val1,val2,"string with ""quotes"" inside"
SELECT .... 
INTO OUTFILE '/.../out.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"', ESCAPED BY '"'
LINES TERMINATED BY '\n';
tables
supported data source
type inference
programmatic interface
multi-upload
specified here
valid type
types
data cleaning workflows
transforms
incorrect options
import keras
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential, Model
from tensorflow.keras.optimizers import RMSprop
from keras.layers import Activation, Dropout, Flatten, Dense, GlobalMaxPooling2D, Conv2D, MaxPooling2D
from keras.callbacks import CSVLogger
from livelossplot.keras import PlotLossesCallback
import efficientnet.keras as efn
import redivis
import os
TRAINING_LOGS_FILE = "training_logs.csv"
MODEL_SUMMARY_FILE = "model_summary.txt"
MODEL_FILE = "cats_vs_dogs.h5"

# Data
path = f"{os.getcwd()}/cats_and_dogs/"
training_data_dir = path + "training/"
validation_data_dir = path + "validation/" 
test_data_dir = path + "test/" 
# Hyperparams
IMAGE_SIZE = 200
IMAGE_WIDTH, IMAGE_HEIGHT = IMAGE_SIZE, IMAGE_SIZE
EPOCHS = 20
BATCH_SIZE = 32
TEST_SIZE = 30

input_shape = (IMAGE_WIDTH, IMAGE_HEIGHT, 3)
# CNN Model 5 (https://towardsdatascience.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8)
model = Sequential()

model.add(Conv2D(32, 3, 3, padding='same', input_shape=input_shape, activation='relu'))
model.add(Conv2D(32, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(64, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(64, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(128, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(128, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(256, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(256, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
            optimizer=RMSprop(learning_rate=0.0001),
            metrics=['accuracy'])

with open(MODEL_SUMMARY_FILE,"w") as fh:
    model.summary(print_fn=lambda line: fh.write(line + "\n"))
# Data augmentation
training_data_generator = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.1,
    zoom_range=0.1,
    horizontal_flip=True)
validation_data_generator = ImageDataGenerator(rescale=1./255)
test_data_generator = ImageDataGenerator(rescale=1./255)
# Data preparation
training_generator = training_data_generator.flow_from_directory(
    training_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=BATCH_SIZE,
    class_mode="binary")
validation_generator = validation_data_generator.flow_from_directory(
    validation_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=BATCH_SIZE,
    class_mode="binary")
test_generator = test_data_generator.flow_from_directory(
    test_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=1,
    class_mode="binary", 
    shuffle=False)
# Training
model.fit_generator(
    training_generator,
    steps_per_epoch=len(training_generator.filenames) // BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=validation_generator,
    validation_steps=len(validation_generator.filenames) // BATCH_SIZE,
    callbacks=[PlotLossesCallback(), CSVLogger(TRAINING_LOGS_FILE,
                                            append=False,
                                            separator=";")], 
    verbose=1)
model.save_weights(MODEL_FILE)
# Testing
probabilities = model.predict_generator(test_generator, TEST_SIZE)
for index, probability in enumerate(probabilities):
    image_path = test_data_dir + "/" +test_generator.filenames[index]
    img = mpimg.imread(image_path)
    plt.imshow(img)
    if probability > 0.5:
        plt.title("%.2f" % (probability[0]*100) + "% dog")
    else:
        plt.title("%.2f" % ((1-probability[0])*100) + "% cat")
    plt.show()
This workflow is on Redivis
Example data files
transform
Create variables
analytic method
Filter
Notebook
Keras
Tensorflow
rerun
create output tables
fork
export

Work with data in notebooks

Overview

Redivis notebooks a performant, flexible environment for analysis that allow you to analyze and visualize data in workflows in Python, R, Stata, or SAS. With the notebook computation happening on Redivis, you don't need to configure an environment on a local machine or server, or export data from Redivis. This makes for easy iteration and collaboration, not to mention ensuring better security and data throughput.

Before working with a notebook you'll want to get started first by creating a workflow and adding data. You can then create a notebook off of any table in your workflow.

If you are working with very large tables (>10GB is a good rule of thumb), it's always a good idea to first reshape and reduce the data via transforms, since they can be significantly more performant for large data operations than running code in Python, R, Stata, or SAS.

1. Create a notebook

Once you have a table that you're ready to analyze, you can create a notebook by clicking the + Notebook button at any time. You'll need to name it and choose a kernel (Python, R, Stata, or SAS).

Notebooks can only reference tables within their workflow, so we recommend keeping all related work together in the same workflow.

Python

Python notebooks come pre-installed with a variety of common scientific packages for python. Learn more about working with python notebooks.

R

R notebooks come pre-installed with a variety of common scientific packages for R. Learn more about working with R notebooks.

Stata

Stata notebooks are based off of python notebooks, but offer affordances for moving data between Python and Stata. Learn more about working with Stata notebooks.

SAS

SAS notebooks are based off of python notebooks, but offer affordances for moving data between Python and SAS. Learn more about working with SAS notebooks.

2. Define dependencies

All notebooks come with a number of common packages pre-installed, depending on the notebook type. But if there is something specific you'd like to include, you can add versioned packages or write a pre-/post- install script by clicking the Edit dependencies button in the start modal or the toolbar.

Learn more in the Notebooks reference section.

3. Compute resources

The default notebook configuration is free, and provides access to 2 CPUs and 32GB working memory, alongside a 60GB (SSD) disk and gigabit network. The computational powerful of these default notebooks are comparable to most personal computers, and will be more than enough for many analyses.

If you're working with larger tables, creating an ML model, or performing other particularly intensive tasks, you may choose to configure additional compute resources for the notebook. This will cost an hourly rate to run based on your chosen environment, and require you to purchase compute credits on your account.

Clicking Edit compute configuration button in the start modal or the toolbar will allow you to choose from different preconfigured machine types. The notebook will then default to this compute configuration each time it starts up.

Learn more in the Compute resources reference section.

4. Start the notebook

Notebook nodes need to be started in order to edit or execute cells. When first clicking on a notebook node, you will see a read-only view of its contents (including cell outputs). Click the Start notebook button in the toolbar to connect this notebook to compute resources.

When you create a notebook for the first time it will start automatically.

5. Load data

To do meaningful work in your notebook, you'll want to bring in the tabular and/or unstructured data that exists in your workflow into your notebook.

Referencing tables

Notebooks come pre-populated with templated code that pulls in data from the notebook's source table. You will need to run this cell to pull the data into the notebook, and you can see that it worked because this code will print a preview of the loaded data.

You can reference any other tables in this workflow by replicating this script and executing it with a different table reference. As a rule of thumb, notebooks will easily support interactive analysis of tables up to ~1GB; if your table is larger, try reducing it first by creating a transform, or make sure to familiarize yourself with the tools for working with larger tables in the notebook's programming language.

import redivis

# The source table of this notebook can always be referenced as "_source_"
table = redivis.table("_source_")

# Load table as a pandas dataframe. 
# Consult the documentation for more load options.
df = table.to_pandas_dataframe()

# We can also reference any other table in this workflow by name.
df2 = redivis.table("my_other_table").to_pandas_dataframe()

print(df)
print(df2)

See more examples in the Python notebooks reference.

# The source table of this notebook can always be referenced as "_source_"
redivis_table <- redivis$table("_source_")

# Load table as a tidyverse tibble. 
# Consult the documentation for more load options.
df <- redivis_table$to_tibble()

# We can also reference any other table in this workflow by name.
df2 <- redivis$table("my_other_table")$to_tibble()

print(df)
print(df2)

See more examples in the R notebooks reference.

# In order to load data into Stata, we first have to bring it into Python.
# This code loads the "_source_" table in the python variable `df`
# We can then pass this variable as our stata dataset.

import redivis

# The source table of this notebook can always be referenced as "_source_"
# Reference any other table in this workflow by name.
table = redivis.table("_source_")

df = table.to_pandas_dataframe(dtype_backend="numpy")
%%stata -d df -force
/*
# Use the %%stata magic to load our dataframe, specified by the -d parameter
# The -force flag replaces the the current working dataset in Stata

# The rest is just Stata code!
*/

describe

See more examples in the Stata notebooks reference.

import saspy
sas = saspy.SASsession(results='HTML')

# We first load the table via python, and then pass the dataframe into SAS
df = redivis.table("_source_").to_pandas_dataframe(dtype_backend="numpy")

sas_data = sas.df2sd(df, '_df') # second argument is the name of the table in SAS
sas_data.heatmap('msrp', 'horsepower')

See more examples in the SAS notebooks reference.

Referencing files

Any files with unstructured data stored in Redivis tables can be referenced by their globally unique file_id. You can also reference these file_id's in any derivative tables, allowing you to query and download specific subsets of files.

When working with large files, you'll want to consider saving the files to disk and/or working with the streaming interfaces to reduce memory overhead and improve performance.

redivis_file = redivis.file("rnwk-acs3famee.pVr4Gzq54L3S9pblMZTs5Q")

# Download the file
download_location = redivis_file.download("./my-downloads")
f = open(download_location, "r")

# Read the file into a variable
file_content = redivis_file.read(as_text=True)
print(file_content)

# Stream the file as bytes or text
with redivis_file.stream() as f:
  f.read(100) # read 100 bytes

with TextIOWrapper(redivis_file.stream()) as f:
  f.readline() # read first line
  
# We can also iterate over all files in a table
for redivis_file in redivis.table("_source_").list_files():
  # Do stuff with file

See more examples in the Python notebooks reference.

redivis_file <- redivis$file("s335-8ey8zt7bx.qKmzpdttY2ZcaLB0wbRB7A")

# Download a file
redivis_file$download("/path/to/dir/", overwrite=TRUE)

# Read a file
data <- redivis_file$read(as_text=TRUE)
  
# Stream a file (callback gets called with each chunk)
data <- redivis_file$stream(function(x) {
  print(length(x))
})

# We can also iterate over all files in a table
for (redivis_file in redivis$table("_source_")$list_files()){
  # Do stuff with file
}

See more examples in the R notebooks reference.

6. Analyze data

At this point, you have all the tools you need to work with your data in your chosen language. The Python, R, Stata, and SAS ecosystems contain myriad tools and libraries for performing sophisticated data analysis and visualization.

The notebook interface is based off of Jupyter notebooks, and has similar capabilities. You can also export a read-only copy of your notebook as an .ipynb, PDF, or HTML file.

Learn more in the Notebooks reference section.

7. Create an output table

Notebooks can produce an output table, which you can sanity check and further analyze in your workflow by including in other notebooks or exporting to other systems.

# Read table into a pandas dataframe
df = redivis.table('_source_').to_pandas_dataframe()

# Perform various data manipulation actions
df2 = df.apply(some_processing_fn)

# Create an output table with the contents of this dataframe
redivis.current_notebook().create_output_table(df2)

See more examples in the Python notebooks reference.

# Read table into a tibble
tbl = redivis$table('_source_')$to_tibble()

# Perform various data manipulation actions
tbl2 = tbl %>% mutate(...)

# Create an output table with the contents of this dataframe
redivis$current_notebook()$create_output_table(tbl2)

See more examples in the R notebooks reference.

%%stata -doutd df2
/*
  Once this cell executes, the current dataset will be pushed 
  to the python variable df2
*/
rename v* newv*
# Via python, pass this dataframe to the output table
redivis.current_notebook().create_output_table(df2)

See more examples in the Stata notebooks reference.

# Convert a SAS table to a pandas dataframe
df = sas_table.to_df()

# Create an output table with the contents of this dataframe
redivis.current_notebook().create_output_table(df)

See more examples in the SAS notebooks reference.

Next steps

Share and collaborate

All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them (much like a Google Doc).

Share your workflow to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.

Cite datasets in your publications

If the work you're doing leads to a publication, make sure to reference the dataset pages from datasets you've used for information from the data administrators on how to correctly cite it.

Reshape data in transforms

Overview

Transforming tables in a workflow is a crucial step when working with data on Redivis. Conceptually, transforms execute a query on a source table and results are materialized in a new output table. They are optimized to run on billions of records in seconds, and create a transparent, reproducible record of data transformation steps you've taken in your workflow.

In most cases you'll want to use transforms to create an output table containing all the information you're interested in before analyzing that table in a notebook or exporting it for further use.

1. Create a transform

Once you've created a workflow and added a dataset to it that you'd like to work with, get started by creating a Transform. You can do this by clicking on any dataset or table.

To build this transform, you will add steps which each take an action to shape the data in the output table. You can choose from many named steps which include a point and click interface that compile to SQL code, or you can add a SQL query step to write code directly. You can always view the code that your step is generating and switch to code if you'd like to edit it.

While SQL might not be a familiar language, it is optimized for data cleaning procedures and allows transforms to execute extremely quickly. It also allows you to write your data transformations in a declarative, reproducible manner.

2. Join additional tables

The first thing you'll want to do is decide if you have all the information you need in this table or if you'd like to join in an additional table or tables. You can reference any other table from either 1) this dataset, 2) another dataset, or 3) an output table from a different transform in this workflow. Any table you want to reference needs to be in this workflow, so you can click the Add dataset button to add a new dataset to the workflow if it's not here already.

To join a table, add a Join step and select the table you'd like to join. You'll then need to select what type of join it will be, and build your join condition. In most cases your join condition will be linking two variables of the same name and type together (e.g. join all records where id = id).

Learn more in the Joins reference section.

3. Create new variables

You might want to generate new columns in this table by creating a new variable. You can do so by adding a Create variables step. Start by giving this variable a name and then choosing what method you want to use to create it.

Some methods are restricted to the type of variable you are working with. For example there are options to add or subtract years from a date variable, or concatenate string variables.

One of the most common new variable methods is Case (if/else). This allows you to set up a statement that looks at the content of each record and evaluates it based on conditions you've set up to generate the value. For example you can say that in the new variable you're creating, if the amount in column A is greater than 1000, the value of this new variable will be set to "high" and if not, then it will be set to "low."

You can create any number of new variables, and they will execute sequentially, allowing you to reference variable's you've created in other variables and subsequent sections.

Learn more in the Create variables reference section.

4. Filter records

You'll probably want to reduce the number of records in this table to exclude any that aren't relevant to the final output table you're creating. This will allow you to execute transforms quickly and get a better understanding of the number of relevant records you have.

To filter records, add a Filter step to your transform and start building the conditions that records will need to meet in order to stay in your output table. These statements can be nested and allow you to reference any variables or record values.

If you find yourself working with a standard list of values you're using in multiple places, this might also be a place to save time and enhance reproducibility by creating and referencing a value list.

Learn more in the Filters reference section.

5. Aggregate data

Depending on the structure of your data you might want to aggregate the data to collapse multiple records down to one, while preserving some information about what was dropped. Conceptually this might look like aggregating a table of charges from one record per charge into one record per person including a variable for the total charge per person.

To get started, add an Aggregate step and select the variables that you will to aggregate on. These will be the variables that exist in your output data after the aggregation is finished. All records that are an exact match in these selected variables will be dropped.

You can also capture information about records being dropped by creating a new aggregate variable. For example, maybe you are aggregating a table with multiple test scores per person down to a table with just one record per person. You can create an aggregate variable with the average test score, or the count of the number of tests each person took.

Learn more in the Aggegate reference section.

5. Select variables

Finally, before running your transform you'll always need to select which variables you want to keep in your output table. Perhaps you referenced a variable in this transform to create a new one, and now you don't need it anymore. Cutting variables means faster execution, and it is easy to add any variables back later and re-run the transform if you realize you need it, so in general try to keep this list as short as possible.

To propagate variables forward into your output table, they need to be in the right-hand box at the top of the transform labeled Keep. You can select any variable, or set of variables and click the > arrow button to move them over.

Learn more in the Variable selection reference section.

6. Run transform and sanity check output

When you're ready to execute the transform click the Run button in the top right of the toolbar.

If this button is disabled, it might be because the transform is invalid for some reason. Hovering on this button will give you more information, or you can look for the alert symbol (!) to see where you'll need to fix the transform in order to make it valid to run.

After you run a transform, you can investigate the downstream output table to get feedback on the success and validity of your querying operation – both the filtering criteria you've applied and the new features you've created.

Understanding at the content of an output table allows you perform important sanity checks at each step of your research process, answering questions like:

  • Did my filtering criteria remove the rows I expected?

  • Do my new variables contain the information I expect?

  • Does the distribution of values in a given variable make sense?

  • Have I dropped unnecessary variables?

To sanity check the contents of a table node, you can inspecting the general table characteristics, checking the summary statistics of different variables, looking at the table's cells, or create a notebook for more in-depth analysis.

7. Make changes and re-run

If there are any issues with your output table, or if you decide to go in a different direction, or add another step, it is easy to go back to your transform and start making changes.

You'll notice that this transform is now yellow and so is its output table. These edited and stale state indications help you keep track of work you're doing in your workflow, and you can easily revert to the previous successfully executed transform state at any point.

Run this transform again to see changes in the output table. This guide describes all the steps you can take in a single transform but perhaps you want to do one step at a time and run it in between each step to sanity check your output. This system is designed for iteration so make as many changes as you want to experiment and build the output you want.

8. Create another transform

From here you can continue to create transforms on your output table, tables from the dataset you added, or add any other dataset on Redivis. You might want to fit as many steps as you can into one transform, or make a long chain to more easily track your work and communicate it to others.

As you build more transforms you'll see that sometimes actions you take create stale chains of transforms. You can easily make upstream changes in your workflow (such as upgrading a dataset from the sample to the full dataset, or updating to a new version) and then run all transforms in your workflow with one click (a Run all option is available in the Map button menu).

Next steps

Work with data in a notebook

You can use notebooks in a workflow to analyze data using Python, R, Stata, or SAS. These notebooks run in the browser with no additional configuration and seamless sharing with collaborators.

Learn more in the Work with data in notebooks guide.

Export data

If you'd like to export data to a different system, you can download it in various file formats, reference it programmatically in Python / R, or visualize in tools such as Google Data Studio.

Learn more in the Export to other environments guide.

Upload your own datasets

Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).

Learn more in the Create and populate a dataset guide.

Share and collaborate

Redivis workflows are built for collaboration and include real-time visuals to see where collaborators with edit access are in the workflow, and a comments interface to discuss changes asynchronously.

Share your workflow to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.

Gini Index by US County 2018 (Example Project)Google Data Studio

Workflow concepts

Overview

Workflows are used to analyze any type of data on Redivis, at any scale. They allow you to organize your analysis into discrete steps, where you can easily validate your results and develop well-documented, reproducible analyses.

Workflows are owned by either a user or an , and can be shared with other users and organizations.

Creating a workflow

To create a workflow, navigate to a dataset that you're interested in working with and press the Analyze data in a workflow button. If you do not have to this dataset you may need to apply for access first.

You can also create a workflow from the "Workflows" tab of your , or from the administrator panel of an organization (in this latter case, the workflow will be "owned" by the organization and its administrators).

Once you've created your workflow, you'll be able to add any dataset or workflow that you have access to as a .

The workflow page

The workflow page consists of a top title bar, a left panel, and a right panel.

The left panel displays , allowing you to visualize how data is moving through the workflow and its .

The right panel shows the contents of the currently selected node in the tree. If no node is selected, this panel will display the workflow's . You can click on the workflow title, or empty space in the workflow tree, to return to the workflow documentation at any time.

The title bar provides an entry point to common actions, broken into two sections: the left section contains actions that are global to the workflow, while the right section contains actions relevant to the currently selected node (e.g,. running a transform).

If no node is selected, information about the workflow will be populated in the right panel:

Overview

The workflow overview contains various metadata, provenance information, and narrative about the workflow.

Data sources

A filterable list of all Data sources within the workflow. Clicking on an item will navigate to the corresponding data source in the workflow tree.

Tables

A filterable list of all within the workflow. Clicking on an item will navigate to the corresponding table in the workflow tree.

Transforms

A filterable list of all within the workflow. Clicking on an item will navigate to the corresponding transform in the workflow tree.

Notebooks

A filterable list of all within the workflow. Clicking on an item will navigate to the corresponding notebook in the workflow tree.

The workflow tree

The workflow "tree" is represented visually in the left pain of the workflow. This tree is made up of a collection of nodes, with each node having various inputs and outputs, such that the output (result) of one node can serve as the input of another.

Data in the tree flows from the top to bottom, and circular relationships are not allowed. Formally, this is known as a "Directed Acyclic Graph" (DAG).

Clicking on a node within the tree will display that node's contents within the right pane of the workflow, while highlighting the ancestors and descendants of that node on the tree.

You can right-click on any node for a list of other options, or if preferred, click on the node and then click the three-dot "More" menu at the top-right.

Workflow nodes

The workflow tree is made up of the following node types:

represent datasets or workflows that have been added to your workflow, and are the mechanism for bringing data into your workflow.

are either tables associated with a data source, or the resulting output table of a transform or notebook.

are queries that are used to reshape and combine data, always creating a single table as an output.

are flexible, interactive programming environments, which can optionally produce a table as an output.

Building a workflow

The main way to build your workflow is to add and edit nodes. You will start by adding data to your workflow, and then create a series of additional nodes that reshape and analyze the data.

Add data to a workflow

You can click the Add data button in the top left corner of the workflow toolbar to select a dataset or another workflow you want to add to this workflow. This will add a copy of the selected to the workflow and allow you to reference its tables.

Each data source can only be added to a workflow one time. By default, all datasets are added at their current but you can right click on the dataset in this modal to select a different version to add.

Reshape and analyze data

All data cutting, reshaping, and analysis on Redivis happens in either a or a . These nodes must be attached to a source table, so can only be created after you've added a data source.

To create a transform or notebook, click on a table and select either the transform or notebook button that appears beneath it. If the table already has a downstream node you can press the plus icon beneath it instead.

Transforms vs. notebooks?

There are two mechanisms for working with data in workflows: and . Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.

Transforms are better for:

  • Reshaping + combining tabular and geospatial data

  • Working with large tables, especially at the many GB to TB scale

  • Preference for a no-code interface, or preference for programming in SQL

  • Declarative, easily documented data operations

Notebooks are better for:

  • Interactive exploration of any data type, including unstructured data files

  • Working with smaller tables (though working with bigger data is possible)

  • Preference for Python, R, Stata, or SAS

  • Interactive visualizations and figure generation

Copy and paste nodes

You can right click on any transform or notebook in the workflow tree to copy it. Once you've copied a node, you can right click on any table to paste the copied transform or notebook.

Insert nodes

If you would like to place a copied transform or notebook between other nodes, you can click on a transform or notebook and select Insert transform.

If you have a transform copied to the clipboard you can insert it between other nodes by right clicking on a transform or notebook and selecting Paste copied transform above. This will insert both the transform and its output table into the branch of the workflow you've selected.

Split and combine transforms

All transforms can be split at the step level into two different transforms by clicking Split in any step's menu. Additionally, two transforms can be combined into one by right clicking on a table to Remove it.

You might want to split a transform above a tricky step to see what the output table would look like at that point in the process. This can be a key tool in troubleshooting any issues and understanding what might be going wrong.

After splitting a transform to check an output table, the next logical step might be to combine these two transforms back into one again. Or perhaps you have a string of transforms which you no longer need the output tables for and want to reduce the size of your workflow.

Delete nodes

To delete a node, right click on the node and select Delete. Tables cannot be deleted directly, but are rather deleted when their parent node is deleted.

When deleting a transform or notebook:

  • The transform or notebook and its output table will be deleted.

  • If the workflow tree has additional nodes downstream, the transform or notebook and its output table will be 'spliced' out, i.e. the upstream node nearest the deleted transform will be connected to the downstream node nearest to the deleted output table.

When deleting a data source:

  • The data source and all directly downstream nodes will be deleted. If additional branches are joined into the branch downstream of the deleted dataset, those branches will be retained up to but not including the transform located in the deleted branch.

Since you can't undo a deletion, you'll receive a warning message before proceeding.

Node states

As you build out a workflow, node colors and symbols will change on the tree to help you keep track of your work progress.

Detailed information about each of these states can be found in the documentation for each node, though some common states are outlined here.

Stale nodes

Stale nodes are indicated with a yellow background. If a node is stale, it means that its upstream content has changed since the node was last run, and likely that the node should be re-run to reflect these upstream changes.

Edited nodes

If node has been edited since when it was last run, it will be indicated with hashed vertical lines.

Tree-level actions

Run all

You can select the Map button on the left side of the workflow toolbar to begin a run of all stale nodes in the workflow. This will execute all transform and notebook nodes in a logical sequence to update the workflow completely.

Shift nodes

To shift a node, select the node and click the arrow that appears next to most nodes when selected. Shifting nodes is purely an organizational tool and it has no effect on the data produced in the workflow.

Navigate nodes

Along with clicking a node to select it, all nodes have a Source and Output navigation button on the right side of the workflow toolbar. You can click this button to jump directly to the immediate upstream or downstream node

Saving

Workflows are continuously saved as you work, and any analyses will continue to run in the background if you close your browser window. You can always navigate back to this workflow later from the "Workflows" tab of your .

A complete is available for all transforms and notebooks in your workflow, allowing you to access historic code and revert your work back to a previous point in time.

Workflow ownership and sharing

All workflows are owned by either a user or an , and can then be shared with other users and organizations. When a workflow is owner or shared with an organization, all administrators of that organization will have corresponding access.

Workflows can also be associated with a , which may be necessary if access to certain datasets in the workflow was granted to that study. In this case, you can specify a level of access to the workflow for other collaborators on the study.

Workflow collaborators will still need access to the underlying data in a workflow to view node contents. For more information, see .

Forking workflows

You have a couple options for how this workflow can be reused in other analyses. Click the Fork button in the toolbar to get started.

  • Add to another workflow

    • Select this option to choose a workflow you'd like to add this workflow to, as a . This will be a linked copy that will update as the original workflow is updated. This can be a very powerful tool in building out complex workflows that all reference the same source analysis or cohort.

  • Clone this workflow

    • This will create a duplicate copy of the workflow, with a link back to the original workflow encoded in its provenance information.

Collaboration

Workflows are made to be a space for collaborative analysis. You can easily to your workflow with colleagues.

Comments

Any node in a workflow can be annotated with a comment by workflow collaborators. Comments are intended to be a space for conversation grounded in a specific area of a workflow. They can be replied to in threads by multiple collaborators and resolved when the conversation is complete.

Simultaneous editors

Multiple users with edit access can be working on a workflow at the same time. When this is the case, you will see their picture in the top menu bar alongside your own and a colored dot on the workflow tree to the right of the node they currently have selected. When a notebook is started you will see any collaborators code edits in real time.

Workflow DOIs

Any workflow editor can issue a DOI (Digital Object Identifier) for a workflow. A DOI is a persistent identifier that can be used to reference this workflow in citations. DOIs are issued through DataCite and do not require any configuration with your own or your organization's DataCite account.

Open the Provenance section and click Issue DOI. Once created, you will be able to see the DOI and view the record on DataCite.

Draft status

When DOIs are issued they enter a "Draft" status where the identifier is assigned but it has not been permanently created. All DOIs issued for workflows will remain in this draft status for seven days to allow for removal of the DOI.

You can start referencing the DOI immediately while it is still in draft status since the final DOI will not change once it becomes permanent. After the seven day draft period the DOI will automatically become permanent if your workflow is set to be .

Since DOIs are intended for public reference, they will not be issued for workflows that remain fully private.

Note that granting public access to a workflow does not grant access to any restricted data it contains. Any individual viewing the workflow will need to also gain data access to see workflow nodes that reference restricted data.

Reproducibility and change management

Every time a transform or notebook is run in your workflow, a snapshot of the code in that node is permanently saved. On any transform or notebook, you will see a "History" button that will bring up all of the previous executions of that node, with the ability to view its historic contents and revert to a previous version of the code. This historic code will also be associated with the corresponding in your workspace.

While the tables within a workflow should be considered "live" in that their data can regularly change as upstream nodes are modified, the ability to permanently persist code (alongside the built in for datasets) ensures that any historic output can be reproduced by simply re-running the historic code that produced a given output.

Date

Current date

Returns the current date –>

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

New date

Constructs a date from a year, month, and day –>

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Date add

Add a period of time to a date –>

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Date diff

Calculate the distance between two dates –>

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Date subtract

Subtract a period of time from a date –>

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Date truncate

Truncates a date to the nearest boundary –>

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Date extract

Extracts the date part (e.g, month) from a date –>

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Format date

Returns a formatted string from a date –>

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

More details about format strings

Parse date

Parses a date from a string –>

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

More details about format strings

CURRENT_DATE([@time_zone])

@time_zone

enum

any of: ACDT, ACST, ACT, ACT, ACWST, ADT, AEDT, AEST, AFT, AKDT, AKST, AMST, AMT, AMT, ART, AST, AST, AWST, AZOST, AZOT, AZT, BDT, BIOT, BIT, BOT, BRST, BRT, BST, BST, BST, BTT, CAT, CCT, CDT, CDT, CEST, CET, CHADT, CHAST, CHOT, CHOST, CHST, CHUT, CIST, CIT, CKT, CLST, CLT, COST, COT, CST, CST, CST, CT, CVT, CWST, CXT, DAVT, DDUT, DFT, EASST, EAST, EAT, ECT, ECT, EDT, EEST, EET, EGST, EGT, EIT, EST, FET, FJT, FKST, FKT, FNT, GALT, GAMT, GET, GFT, GILT, GIT, GMT, GST, GST, GYT, HDT, HAEC, HST, HKT, HMT, HOVST, HOVT, ICT, IDLW, IDT, IOT, IRDT, IRKT, IRST, IST, IST, IST, JST, KALT, KGT, KOST, KRAT, KST, LHST, LHST, LINT, MAGT, MART, MAWT, MDT, MET, MEST, MHT, MIST, MIT, MMT, MSK, MST, MST, MUT, MVT, MYT, NCT, NDT, NFT, NPT, NST, NT, NUT, NZDT, NZST, OMST, ORAT, PDT, PET, PETT, PGT, PHOT, PHT, PKT, PMDT, PMST, PONT, PST, PST, PYST, PYT, RET, ROTT, SAKT, SAMT, SAST, SBT, SCT, SDT, SGT, SLST, SRET, SRT, SST, SST, SYOT, TAHT, THA, TFT, TJT, TKT, TLT, TMT, TRT, TOT, TVT, ULAST, ULAT, UTC, UYST, UYT, UZT, VET, VLAT, VOLT, VOST, VUT, WAKT, WAST, WAT, WEST, WET, WIT, WST, YAKT, YEKT

false

(UTC (default))

DATE(@year, @month, @day)

@year

variable or literal

any integer

true

-

@month

variable or literal

any integer

true

-

@day

variable or literal

any integer

true

-

DATE_ADD(@date_expression, INTERVAL @integer_expression @date_part)

@integer_expression

variable or literal

any integer

true

-

@date_part

enum

any of: year, month, quarter, week, day

true

((e.g., days))

@date_expression

variable or literal

any date

true

-

DATE_DIFF(@date_expression, @date_expression_2, @date_part)

@date_expression_2

variable or literal

any date

true

-

@date_expression

variable or literal

any date

true

-

@date_part

enum

any of: day, week, month, quarter, year

true

((e.g., days))

DATE_SUB(@date_expression, INTERVAL @integer_expression @date_part)

@integer_expression

variable or literal

any integer

true

-

@date_part

enum

any of: day, week, month, quarter, year

true

((e.g., days))

@date_expression

variable or literal

any date

true

-

DATE_TRUNC(@date_expression, @date_truncate_part)

@date_expression

variable or literal

any date

true

-

@date_truncate_part

enum

any of: day, month, quarter, year, week(sunday), week(monday), week(tuesday), week(wednesday), week(thursday), week(friday), week(saturday), ISOquarter, ISOyear

true

((e.g., month))

EXTRACT(@date_part FROM @date_expression)

@date_part

enum

any of: DAYOFWEEK, DAY, DAYOFYEAR, WEEK, MONTH, QUARTER, YEAR, ISOWEEK, ISOYEAR

true

((e.g., month))

@date_expression

variable or literal

any date, dateTime, time

true

-

FORMAT_DATE(@format_string, @date_expression)

@date_expression

variable or literal

any date

true

-

@format_string

literal

any string

true

(e.g., %Y-%m-%d)

[@safe]PARSE_DATE(@format_string, @date_string)

@date_string

variable or literal

any string

true

-

@format_string

literal

any string

true

(e.g., %Y-%m-%d)

@safe

boolean

any boolean

true

-

learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
here
learn more
here

Abstract

The abstract is limited to 256 characters and will show up in previews and search results for the dataset. This should be a concise, high-level summary of this dataset.

Provenance

View and update information about this workflow's creators and contributors, citation, and any related identifiers detected by Redivis automatically or added by a workflow editor. This is also where you can issue a DOI for this workflow.

Provenance

→ Creators

This section automatically defaults to displaying the owner of the project. Workflow editors can add or remove anyone from this list. Anyone included here will be added to the citation generated for this workflow.

Provenance

→ Contributor

This section automatically includes anyone who edited this workflow. Workflow editors can add or remove anyone from this list.

Provenance

→ Citation

This section shows the automatically generated citation for this workflow in your chosen format. This can be copied or downloaded for use elsewhere.

Changes made to the "Creators" field will be reflected in this citation. Any DOI issued for this workflow will automatically be included in this citation.

Provenance

→ Related identifiers

This section automatically includes any datasets or workflows referenced by this project, including data sources, study collaboration, or what this workflow was forked from. Workflow editors can add any related identifiers from outside of Redivis through links or DOIs, including DMPs, papers referenced, and more.

Provenance

→ Bibliography

You can launch a bibliography which displays the citation of this workflow and all of its related identifiers.

Methodology

Document the details of your research aim or data analysis strategy. You can also embed links or images.

Sharing / Sharing

You can give other users access to view or edit your workflow, or transfer ownership to another user in the Sharing section. You can also set this workflow's visibility and discoverability. Anyone viewing your workflow will need to have gained data access to any restricted datasets to view the relevant node contents.

Study

You can add your workflow to a study in order to facilitate collaboration with others. For certain restricted datasets, your workflow will need to be part of an approved study in order to run queries.

Tags

You can add up to 25 tags to your workflow, which will help researchers discover and understand it.

Usage

You can see the data of last workflow activity, how often the workflow was forked or viewed (if it is a public workflow).

Tags

In addition to documentation, you may add up to 25 tags to your dataset, which will help researchers discover and understand the dataset.

Other metadata

Additionally, information about the dataset's size and temporal range will be automatically computed from the metadata on its tables. Additional table documentation, as well as the variable metadata, will be indexed and surfaced as part of the dataset discovery process.

organization
data access
workspace
data source
the workflow tree
nodes
documentation
Tables
Transforms
Notebooks
Data sources
Tables
Transforms
Notebooks
data source
version
transform
notebook
transforms
notebooks
workspace
version history
organization
study
Workflow access & sharing
data source
share access
publicly visible
log entry
version-control
Logo

DateTime

Current DateTime

Returns the current dateTime –>

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

New DateTime

Constructs a dateTime from a year, month, day, hour, minute, and second –>

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

DateTime add

Add a period of time to a dateTime –>

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

DateTime diff

Calculate distance between two dateTimes –>

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

DateTime subtract

Subtract a period of time from a dateTime –>

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

DateTime truncate

Truncates a DateTime to the nearest boundary –>

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

DateTime extract

Extracts the date or time part (e.g, hour) from a DateTime –>

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

Format dateTime

Returns a formatted string from a dateTime –>

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

More details about format strings

Parse dateTime

Parses a dateTime from a string –>

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

More details about format strings

Notebook concepts

Overview

Notebooks provide a highly flexible compute environment for working with data on Redivis. In a notebook, you can in your workflow, , perform analyses in , store and download files, and generate an output table for downstream analysis.

Transforms vs. notebooks?

There are two mechanisms for working with data in workflows: and . Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.

Transforms are better for:

  • Reshaping + combining tabular and geospatial data

  • Working with large tables, especially at the many GB to TB scale

  • Preference for a no-code interface, or preference for programming in SQL

  • Declarative, easily documented data operations

Notebooks are better for:

  • Interactive exploration of any data type, including unstructured data files

  • Working with smaller tables (though working with bigger data is possible)

  • Preference for Python, R, Stata, or SAS

  • Interactive visualizations and figure generation

Working with data

Loading data

From within your notebook, you can load any data available in your workflow. You can reference the primary source table of the notebook via the special _source_ identifier, or reference any other table in the workflow by its name. To ensure that your notebook doesn't break when tables get renamed, make sure to use the for non-primary tables. For example:

Analyzing data

Redivis notebooks support the following kernels (programming languages). For more details and examples on how to use notebooks in each language, consult the language-specific documentation:

Outputting tables

A notebook can generate an output table as the result of its execution. This output table is created programmatically, e.g.:

Storing files

As you perform your analysis, you may generate files that are stored on the notebook's hard disk. There are two locations that you should write files to: /out for persistent storage, and /scratch for temporary storage.

Any files written to persistent storage will be available when the notebook is stopped, and will be restored to the same state when the notebook is run again. Alternatively, any files written to temporary storage will only exist for the duration of the current notebook session.

To write files to these directories, use the standard tools of your programming language for writing files. E.g.,:

You can these files anytime.

Notebook management

Creation

Create a notebook by clicking on a in a workflow and selecting + Notebook. This table will become the default source table for your new notebook and will have pre-generated code that references the table's data.

Starting and stopping

Notebook nodes need to be started in order to edit or execute cells. Click the purple Start notebook button in the top right to start the notebook and provision compute resources. You can also elect to "Clear outputs and start", which will remove all outputs and reset any in the notebook.

Run notebooks in the background

When starting a notebook, you can elect the option to "Run in background". This will run a notebook similar to a , where all code is executed in series, and the notebook stops once all cells have been run, or an error occurs. This can be helpful for quickly re-running notebooks after upstream changes have been made.

Server-side execution (alpha)

When starting a notebook, you will see an option to enable server-side execution. This allows the Jupyter notebook to keep receiving outputs even if your browser is closed or disconnected, which can be particularly helpful for long-running operations. This is a new feature within Jupyter notebooks, and still has a few rough edges (e.g., progress bars often don't display), so it is currently an opt-in feature. In the future, this will become the default behavior for all notebooks.

Compute configuration

By default notebooks are provisioned with 32GB memory and 2 CPU cores, with compute power comparable to most personal computers. You can view and alter the notebook's in the More menu.

Persistence

All notebooks are automatically saved as you go. Every time a notebook is stopped, all cell inputs are saved to the notebook version history, giving you a historical record of all code that was run. Additionally, all cell outputs from the last notebook session will be preserved, as will any files written to the /out directory.

Clearing outputs

When starting a notebook, you'll be presented with the option to "Clear all outputs and start". This can be helpful in that it will reset all access rules associated with the notebook, since there is no longer any data associated with the notebook.

Choosing this option will clear all output cells in your notebook, any files saved in the /out directory, and any output tables from the notebook.

Logs

You can click the three-dot More menu to open the logs for this notebook. Opening the logs when a notebook is stopped will show the logs from the notebook's previous run.

Lifecycle

The default notebooks have a maximum lifetime of 6 hours, and after 30 minutes of inactivity, any running notebook will automatically be stopped. If you are use a notebook with , these values can be modified.

Activity is determined based on the Jupyter kernel — if you have a long-running computation, the notebook will be considered as active for the entire time.

Collaboration

All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them. Workflow viewers will see a read-only version of the notebook.

Changing the source table

To change a notebook's primary source table, either right-click on the notebook or click the three-dot (ⵗ) icon and select the "change source table" option.

Limitations

Notebooks are subject to certain .

Dependencies

All notebooks come with a number of common packages pre-installed. You can install additional packages by clicking the Edit dependencies button in the notebook start modal or toolbar.

For more detailed information about the default dependencies and adding new packages, consult the documentation for your notebook type:

For notebooks that reference restricted data, internet will be disabled while the notebook is running. This means that the dependencies interface is the only place from which you can install dependencies – e.g., running pip install for python or devtools::install() for R within your notebook will fail.

Moreover, it is strongly recommended to always install your dependencies through the dependencies interface (regardless of whether your notebook has internet access), as this provides better reproducibility and documentation for future use.

Secrets

Secrets are simple key/value pairs that are securely stored within an organization or under your account. These secrets can then be loaded in a notebook – a common use case is for storing external API tokens that then enable you to interface with these APIs from within your notebook.

Secrets are accessed via the or client libraries:

Files

Notebooks offer special capabilities for on the notebook's hard disk. Any files you've stored in a notebook's /out and /scratch directories will be available in the files modal. This modal can allow you to preview and download specific file outputs from your notebook.

Moreover, files written to the /out directory are always available, and will persist across notebook sessions. This allows for workflows where you can cache certain results between notebook sessions, avoiding the need to rerun time-intensive computations.

The files in the /scratch directory are only available when the notebook is running, and will be cleared once it is stopped. The default "working directory" of all notebooks is /scratch – this is where files will be written if you do not specify another location.

You can view the files in either directory by pressing the Files button at the top right of the notebook.

You can list files in either directory by pressing the corresponding tab, and click on any file to view it. Redivis supports interactive previews for many in the file inspector, and you can also download the file for further inspection and analysis. To download all files in a directory, click the Download all button in the files modal.

Version history

Every time you stop your notebook, all cell inputs (your code and markdown) will be saved and associated with that notebook session. You can view the code from all previous sessions by pressing the History button at the top right of your notebook, allowing you to view and share your code as it was any previous point in time.

Access rules

Determining notebook access

Your access to a notebook is determined by your corresponding access to all tables (and their antecedent datasets) referenced by the notebook. These linkages persist across notebook sessions, as a future session could reference data from a previous session. In order to reset the tables referenced by your notebook, which will also clear all outputs in the notebook, you can choose to Clear outputs and start when .

Access levels

In order to view a notebook, you must first have view access to the corresponding workflow, and in order to run and edit the notebook, you must also have edit access to that workflow.

Additionally, your access to a notebook is governed by your to its source tables. In order to run a notebook and see its outputs, you must have data access to all source tables. If you have metadata access, you will be able to see cell inputs in a notebook (that is, the code), but not outputs. If you only have overview (or no) access to the source tables, you will not be able to see notebook contents.

External internet access

If a notebook contains data with export restrictions, access to the external internet will be disabled while the notebook is running.

When the internet is disabled in a notebook you can still specify packages and other startup scripts in the Dependencies modal that will be installed on notebook start. Additionally, if any of your packages require internet access to run, you'll need to attempt to "preload" any content using a post-install script. For example, if you're using the tidycensus package in R, you could preload content as follows:

Downloading files

Typically, you will be able to download any files written to the notebooks /out or /scratch directories. However, if a notebook references data with export restrictions, you will not be able to download these files, unless the file size is smaller than the relevant specified on source datasets.

Exporting notebooks

Notebooks can be downloaded as PDF, HTML, and .ipynb files by clicking the three-dot More button at the top right of the notebook.

You will be given the option of whether to include cell outputs in your export — it is important that you ensure the outputs displayed in your notebook do not contain sensitive data, and that your subsequent distribution is in compliance with any data use agreements.

CURRENT_DATETIME([@time_zone])

@time_zone

enum

any of: ACDT, ACST, ACT, ACT, ACWST, ADT, AEDT, AEST, AFT, AKDT, AKST, AMST, AMT, AMT, ART, AST, AST, AWST, AZOST, AZOT, AZT, BDT, BIOT, BIT, BOT, BRST, BRT, BST, BST, BST, BTT, CAT, CCT, CDT, CDT, CEST, CET, CHADT, CHAST, CHOT, CHOST, CHST, CHUT, CIST, CIT, CKT, CLST, CLT, COST, COT, CST, CST, CST, CT, CVT, CWST, CXT, DAVT, DDUT, DFT, EASST, EAST, EAT, ECT, ECT, EDT, EEST, EET, EGST, EGT, EIT, EST, FET, FJT, FKST, FKT, FNT, GALT, GAMT, GET, GFT, GILT, GIT, GMT, GST, GST, GYT, HDT, HAEC, HST, HKT, HMT, HOVST, HOVT, ICT, IDLW, IDT, IOT, IRDT, IRKT, IRST, IST, IST, IST, JST, KALT, KGT, KOST, KRAT, KST, LHST, LHST, LINT, MAGT, MART, MAWT, MDT, MET, MEST, MHT, MIST, MIT, MMT, MSK, MST, MST, MUT, MVT, MYT, NCT, NDT, NFT, NPT, NST, NT, NUT, NZDT, NZST, OMST, ORAT, PDT, PET, PETT, PGT, PHOT, PHT, PKT, PMDT, PMST, PONT, PST, PST, PYST, PYT, RET, ROTT, SAKT, SAMT, SAST, SBT, SCT, SDT, SGT, SLST, SRET, SRT, SST, SST, SYOT, TAHT, THA, TFT, TJT, TKT, TLT, TMT, TRT, TOT, TVT, ULAST, ULAT, UTC, UYST, UYT, UZT, VET, VLAT, VOLT, VOST, VUT, WAKT, WAST, WAT, WEST, WET, WIT, WST, YAKT, YEKT

false

(UTC (default))

DATETIME(@year, @month, @day, @hour, @minute, @second)

@year

variable or literal

any integer

true

-

@month

variable or literal

any integer

true

-

@day

variable or literal

any integer

true

-

@hour

variable or literal

any integer

true

-

@minute

variable or literal

any integer

true

-

@second

variable or literal

any integer

true

-

DATETIME_ADD(@dateTime_expression, INTERVAL @integer_expression @dateTime_part)

@integer_expression

variable or literal

any integer

true

-

@dateTime_part

enum

any of: year, quarter, month, week, day, hour, minute, second, millisecond, microsecond

true

((e.g., days))

@dateTime_expression

variable or literal

any dateTime

true

-

DATETIME_DIFF(@dateTime_expression, @dateTime_expression_2, @dateTime_part)

@dateTime_expression_2

variable or literal

any dateTime

true

-

@dateTime_expression

variable or literal

any dateTime

true

-

@dateTime_part

enum

any of: year, quarter, month, week, day, hour, minute, second, millisecond, microsecond

true

((e.g., days))

DATETIME_SUB(@dateTime_expression, INTERVAL @integer_expression @dateTime_part)

@integer_expression

variable or literal

any integer

true

-

@dateTime_part

enum

any of: year, quarter, month, week, day, hour, minute, second, millisecond, microsecond

true

((e.g., days))

@dateTime_expression

variable or literal

any dateTime

true

-

DATETIME_TRUNC(@dateTime_expression, @dateTime_truncate_part)

@dateTime_expression

variable or literal

any dateTime

true

-

@dateTime_truncate_part

enum

any of: hour, minute, second, millisecond, microsecond, day, month, quarter, year, week(sunday), week(monday), week(tuesday), week(wednesday), week(thursday), week(friday), week(saturday), ISOquarter, ISOyear

true

((e.g., minutes))

EXTRACT(@dateTime_part FROM @date_expression)

@dateTime_part

enum

any of: YEAR, QUARTER, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, DAYOFYEAR, DAYOFWEEK, ISOYEAR, ISOWEEK

true

((e.g., hour))

@date_expression

variable or literal

any dateTime

true

-

FORMAT_DATETIME(@format_string, @dateTime_expression)

@dateTime_expression

variable or literal

any dateTime

true

-

@format_string

literal

any string

true

(e.g., %Y-%m-%d %H:%M:%S)

[@safe]PARSE_DATETIME(@format_string, @dateTime_string)

@dateTime_string

variable or literal

any string

true

-

@format_string

literal

any string

true

(e.g., %Y-%m-%d %H:%M:%S)

@safe

boolean

any boolean

true

-

learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
here
learn more
here
# Reference the source table with the special "_source_" identifier:
df = redivis.table("_source_").to_pandas_dataframe()

# Reference any other table via its name:
# The last 4 characters are the reference id. This is optional, 
#     but recommended to ensure the notebook works as tables get renamed.
df2 = redivis.table("daily_observations:vdwn").to_pandas_dataframe()

# If our table is a file index table, we can load and process those files
for f in redivis.table("_source_").list_files():
    data = f.read()
# Reference the source table with the special "_source_" identifier:
df <- redivis$table("_source_")$to_tibble()

# Reference any other table via its name:
# The last 4 characters are the reference id. This is optional, 
#     but recommended to ensure the notebook works as tables get renamed.
df2 = redivis$table("daily_observations:vdwn")$to_tibble()

# If our table is a file index table, we can load and process those files
for (f in redivis$table("_source_")$list_files()){
    data = f.read()
}
# We first load the table via python, and then pass the dataframe into stata
df = redivis.table("_source_").to_pandas_dataframe(dtype_backend="numpy")
%%stata -d df -force
/* Run stata code! All stata cells must be prefixed with %%stata */
describe()
import saspy
import redivis

sas_session = saspy.SASsession()

# We first load the table via python, and then pass it into SAS
df = redivis.table("_source_").to_pandas_dataframe(dtype_backend="numpy")

# Load the table into SAS, giving it the name "df"
sas_data = sas_session.df2sd(df, table="df")
%%SAS sas_session
/* 
    Run SAS code! All SAS cells must be prefixed with %%SAS,
    and reference the sas_session variable
*/
proc print data=df(obs=5);
run;
# Multiple types for "df" are supported
# Consult the language-specific docs for more info
df = get_dataframe_somehow()

redivis.current_notebook().create_output_table(df)
# Multiple types for "df" are supported
# Consult the language-specific docs for more info
df <- get_dataframe_somehow()

redivis$current_notebook()$create_output_table(df)
%%stata -doutd df2
/*
  Once this cell executes, the current dataset will be pushed 
  to the python variable df2
*/
rename v* newv*
# Via python, pass this dataframe to the output table
redivis.current_notebook().create_output_table(df2)
# Reference the table table named "some_table" in SAS
sas_table = sas_session.sasdata("some_table")

# Convert the sas_table to a pandas dataframe
df = sas_table.to_df()

redivis.current_notebook().create_output_table(df)
df = get_dataframe_somehow()

# Write temporary files to /scratch, and files you want to persist to /out:
df.to_csv("/out/data.csv")
df <- get_dataframe_somehow()

write.csv(df, "/out/data.csv", na="")
%%stata
save "/out/my_dataset.dta"
%%SAS sas_session
proc export data=datasetname
  outfile='/out/filename.csv'
  dbms=csv
  replace;
run;
# secrets are case-sensitive
secret = redivis.user("my_username").secret("EXTERNAL_API_TOKEN")
# Or for organization secrets:
# secret = redivis.organization("organization_name").secret("...")

# Avoid printing secrets, but rather pass them to other methods as needed
make_http_request("https://example.com/api", auth_token=secret.get_value())
# secrets are case-sensitive
secret <- redivis$user("my_username")$secret("EXTERNAL_API_TOKEN")
# Or for organization secrets:
# secret <- redivis$organization("organization_name").secret("...")

# Avoid printing secrets, but rather pass them to other methods as needed
make_http_request("https://example.com/api", auth_token=secret$get_value())
R -e '
  library(tidycensus)
  library(tidyverse)

  census_api_key("YOUR API KEY GOES HERE")
  get_decennial(geography = "state", 
                 variables = "P13_001N", 
                 year = 2020,
                 sumfile = "dhc")
'
reference any table
install dependencies
Python, R, Stata, or SAS
transforms
notebooks
qualified reference
inspect and download
table node
referenced tables
transform
compute resources
paid custom compute
concurrency and duration limits
Python base image & dependencies
R base image & dependencies
Stata base image & dependencies
SAS base image & dependencies
files written to specific directories
file types
starting the notebook
access
size-based export restrictions
Python notebooks
R notebooks
Stata notebooks
SAS notebooks

Aggregate

Any value

Returns any value from the input or NULL if there are zero input rows –> learn more

ANY_VALUE(@variable)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

Average

Returns the average of all non-null values –> learn more

AVG(@variable)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any ,

true

-

Count

Returns the count of all non-null values –> learn more

COUNT([@distinct ]@variable)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

false

(*)

@distinct

any

true

-

Logical and

Returns the logical AND of all non-NULL expressions –> learn more

LOGICAL_AND(@variable)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

Logical or

Returns the logical OR of all non-NULL expressions –> learn more

LOGICAL_OR(@variable)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

Max

Returns the maximum value of all non-null inputs –> learn more

MAX(@variable)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

Min

Returns the minimum value of all non-null inputs –> learn more

MIN(@variable)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

String aggregate

Returns a string obtained by concatenating all non-null values –> learn more

STRING_AGG([@distinct ]@variable[, @delimiter][ LIMIT @limit])

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

@delimiter

any

false

(",")

@distinct

any

true

-

@limit

any

false

-

Sum

Returns the sum of all values, ignoring nulls –> learn more

SUM(@variable)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any ,

true

-

Step: Aggregate

Overview

The Aggregate step collapses rows that are identical across a set of variables, optionally creating new aggregate variables in the process.

Example starting data:

/*---------+------------+---------+--------*
 | test    | date       | student | score  |
 +---------+------------+---------+--------+
 | quiz    | 2020-04-01 | jane    | 83     |
 | quiz    | 2020-04-01 | pat     | 35     |
 | midterm | 2020-05-01 | jane    | 74     |
 | midterm | 2020-05-01 | pat     | 62     |
 *---------+------------+---------+--------*/

Example output data

Collapsing on variables test and date, and creating new variable average_score to aggregate data from the score variable.

/*---------+-------------+---------------*
 | test    | date        | average_score |
 +---------+-------------+---------------|
 | quiz    | 2020-04-01  | 59            |
 | midterm | 2020-05-01  | 68            |
 *---------+-------------+---------------*/

Step structure

  • There is one collapse block where we will define how the data will be reshaped. On execution:

    • Your data will be cut to only include the variables chosen in this block.

    • Duplicates records across the collapsed variables will be dropped.

  • There can be one or more aggregation blocks where we can capture aggregate information in a newly created variable.

    • You can successfully collapse your table without creating any new variables in aggregation blocks.

    • Each aggregation block in the step represents one new variable in your output data.

    • Aggregation blocks how you can capture information about records dropped in your collapse block.

Field definitions

Collapse block:

Field
Description

Variables to collapse on

All variables you want to include in your output.

Aggregation block(s):

Field
Description

Name

The name of the new variable being created.

Aggregation method

How the new variable will be summarized (e.g. SUM or COUNT).

[Method fields]

After you select your aggregation method, you will be prompted to input the information your chosen method needs to execute. You can see more specifics in the section.

Examples

Example 1: Basic collapse

We have test score data recorded per test, student, and date. However we want to know the average score on each test overall.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Variables to collapse on: Of the variables in our table, the only ones we want in our final output are test and date, so we select these here. We leave out student because that information doesn't matter to us anymore, and we leave out score because we are creating a new variable to replace it with aggregated information.

  • Name: We give our new variable a descriptive name average_score.

  • Aggregation method: We want to average all values in the grades variable per test, so we choose Average.

  • Variable to aggregate: Here is where we choose score as the variable containing the data we want to Average. If we had chosen a different Aggregation method, we might have different input fields here to answer.

Execution:

All variables are removed that aren't collapsed on or used in aggregation.

/*---------+-------+------------*
 | test    | score | date       |
 +---------+-------+------------+
 | quiz    | 83    | 2020-04-01 |
 | quiz    | 35    | 2020-04-01 |
 | quiz    | 89    | 2020-04-01 |
 | midterm | 74    | 2020-05-01 |
 | midterm | 62    | 2020-05-01 |
 | midterm | 93    | 2020-05-01 |
 | final   | 77    | 2020-06-01 |
 | final   | 59    | 2020-06-01 |
 | final   | 92    | 2020-06-01 |
 *---------+-------+------------*/

The Average is created of the score where test and date (our collapsed on variables) have the exact same values.

/*---------+-------+------------+---------------*
 | test    | score | date       | average_score |
 +---------+-------+------------+---------------+
 | quiz    | 83    | 2020-04-01 | 69            |
 | quiz    | 35    | 2020-04-01 | 69            |
 | quiz    | 89    | 2020-04-01 | 69            |
 | midterm | 74    | 2020-05-01 | 76.3333       |
 | midterm | 62    | 2020-05-01 | 76.3333       |
 | midterm | 93    | 2020-05-01 | 76.3333       |
 | final   | 77    | 2020-06-01 | 76            |
 | final   | 59    | 2020-06-01 | 76            |
 | final   | 92    | 2020-06-01 | 76            |
 *---------+-------+------------+---------------*/

The score variable is removed, since it was not collapsed on.

/*---------+------------+---------------*
 | test    | date       | average_score |
 +---------+------------+---------------+
 | quiz    | 2020-04-01 | 69            |
 | quiz    | 2020-04-01 | 69            |
 | quiz    | 2020-04-01 | 69            |
 | midterm | 2020-05-01 | 76.3333       |
 | midterm | 2020-05-01 | 76.3333       |
 | midterm | 2020-05-01 | 76.3333       |
 | final   | 2020-06-01 | 76            |
 | final   | 2020-06-01 | 76            |
 | final   | 2020-06-01 | 76            |
 *---------+------------+---------------*/

Then all exact duplicate records are dropped, to create the output.

Output data:

/*---------+-------------+---------------*
 | test    | date        | average_score |
 +---------+-------------+---------------|
 | quiz    | 2020-04-01  | 69            |
 | midterm | 2020-05-01  | 76.3333       |
 | final   | 2020-06-01  | 76            |
 *---------+-------------+---------------*/

Example 2: Multiple aggregation variables

We can build on our previous example, but say we want to know additionally how many students took each test.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

2nd aggregation block

  • Name: We want to give the new variable a descriptive name unique from our other block (and other variables in our table), in this case test_count.

  • Aggregation method: We want to count all values in the test variable per date, so we choose Count.

  • Variable to count: We choose test as the variable containing the data we want to Count. Since none of our variables have null entries, we could choose any variable here and get the same result. If we did have nulls, they would not be included in the Count. Conceptually we do not want to only include distinct values in our count so we leave that off

Output data:

/*---------+-------------+---------------+-------------*
 | test    | date        | average_score | test_count  |
 +---------+-------------+---------------+-------------+
 | quiz    | 2020-04-01  | 69            | 3           |
 | midterm | 2020-05-01  | 76.3333       | 3           |
 | final   | 2020-06-01  | 76            | 3           |
 *---------+-------------+---------------+-------------*/

Example 3: Drop duplicates

This step can be used to drop duplicated records even in cases where no aggregation happens.

Let's say we have data that we know had duplicate records that we don't need.

Starting data:

/*---------+-----------+------------*
 | test    | questions | date       |
 +---------+-----------+------------+
 | quiz    | 10        | 2020-04-01 |
 | quiz    | 35        | 2020-04-01 |
 | quiz    | 10        | 2020-04-01 |
 | midterm | 20        | 2020-05-01 |
 | midterm | 20        | 2020-05-01 |
 | midterm | 20        | 2020-05-01 |
 | final   | 45        | 2020-06-01 |
 | final   | 45        | 2020-06-01 |
 | final   | 45        | 2020-06-01 |
 *---------+-----------+------------*/

Input fields:

Variables to collapse on: To drop all records that are an exact duplicate across all variables, we just need the collapse block with no aggregation blocks. We need to select all variables here and can do so by typing all of them out, or inputting *.

Output data:

/*---------+-----------+------------*
 | test    | questions | date       |
 +---------+-----------+------------+
 | quiz    | 10        | 2020-04-01 |
 | midterm | 20        | 2020-05-01 |
 | final   | 45        | 2020-06-01 |
 *---------+-----------+------------*/
variable
Redivis type
variable
integer
float
variable
Redivis type
boolean
boolean
variable
boolean
variable
boolean
variable
Redivis type
variable
Redivis type
variable
string
literal
Redivis type
boolean
boolean
literal
Redivis type
variable
integer
float
Variable creation methods

Step: Join

Overview

A Join step will combine data from two tables based on a join condition so it can be queried together.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

Example output data:

Inner join where student = student

/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

Step structure

  • There will be at least one join block where you will define a table to join and a join condition for that table.

  • When multiple blocks exist, the tables will be joined in sequence.

Field descriptions

Field
Description

Join table

The table containing the data you wish to combine with this transform's data. Once selected, this table's variable will be marked t1 (or t2 etc. if other joins exist).

Join type

The way that the tables will be combined. (More information below.)

Source variable

The variable in your source table that you want to join on. All records in this variable will be matched against all records in the Joined table variable.

Joined table variable

The variable in your selected Join table that you want to join on. All records in this variable will be matched against all records in the Source variable.

Join types

Inner join

If a row in either the source or join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

Left join

If a row in the source doesn’t have a match, all joined variables will be null for that row. If a row in the join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

LEFT JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | zay     | 2        | NULL  |
 *---------+----------*-------*/

Right join

If a row in the source doesn’t have a match, it will be dropped. If a row in the join table doesn’t have a match, all source variables will be null for that row. If a row is matched multiple times, it will be multiplied in the output. This is the same as a left join, but reversed.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

RIGHT JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | toni    | NULL     | 30    |
 *---------+----------*-------*/

Full join

If a row in either the source or join table doesn’t have a match, all source variables will be null for that row, including for the column being joined upon. If a row is matched multiple times, it will be multiplied in the output.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

FULL JOIN on student

Output:
/*-----------+-----------+----------+--------*
 | student_A | student_B | absences | score  |
 +-----------+-----------+----------+--------+
 | jane      | jane      | 0        | 85     |
 | sam       | sam       | 6        | 64     |
 | pat       | pat       | 1        | 88     |
 | zay       | NULL      | 2        | NULL   |
 | NULL      | toni      | NULL     | 30     |
 *-----------+-----------+----------+--------*/

Cross join

Every row in the source table will be combined with every row in the joined table. This might be used to perform a join on a new variable that will be created downstream (such as in a geospatial join). You will almost always need to use a filter after this join for the query to successfully execute.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
*----------+----------*/      *---------+-------*/

CROSS JOIN

Output:
/*---------+----------+---------+---------*
 | student | absences | student | score   |
 +---------+----------+---------+---------+
 | jane    | 0        | jane    | 85      |
 | jane    | 0        | sam     | 64      |
 | sam     | 6        | jane    | 85      |
 | sam     | 6        | sam     | 64      |
 *---------+----------*---------+---------*/

It is strongly recommended to use a Filter step with a cross join to avoid a massively expanded table.

Geospatial joins

When querying geospatial data, you'll often want to match records where one polygon is contained in another, or otherwise overlaps. To perform a geospatial join, you'll typically perform the following steps:

  1. First, create an inner join with the table you'd like to join on.

  2. Next, create variable(s) using a geography method that will represent your join condition. For example, you might use the contains method to join all geometries in one table that exist within another.

  3. Finally, implement a row filter that tests against the newly created variable. In our example above, if we created an is_contained_by variable, we would filter on the condition is_contained_by = TRUE

There are a few performance pitfalls when performing geospatial joins. Most notably, combining other equality comparisons with the geospatial condition in a filter can prevent the query planner from leveraging geospatial indexes, leading to a massive performance de-optimization.

For example, executing a filter of the form t0.state = t1.state AND is_contained_by=TRUE would actually be significantly less performant than if the state equality comparison is removed, even though it seems that this would reduce the number of times the geospatial condition needs to be evaluated.

For more discussion on geospatial performance in BigQuery, Redivis's underlying querying engine, see here and here.

Join structures

Depending on the structure of your data, joins might multiply in rows.

1-to-1 joins

If the variable you are joining on has no duplicate values in either the source table or the joined table, then your output table will always have the same or fewer rows than the sum of both table's rows added together.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

1-many joins

If the variable you are joining on has duplicates in either table, any time there is is a duplicate row it will be matched for each duplicate.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      | pat     | 86    |
                              *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | pat     | 1        | 86    |
 *---------+----------*-------*/

Many-to-many joins

If the variable you are joining on has duplicates in both tables, any time there is is a duplicate row it will be matched for each duplicate.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | pat     | 3        |       | pat     | 68    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | pat     | 1        | 86    |
 | pat     | 3        |     |
 | pat     | 3        | 86    |
 *---------+----------*-------*/

Examples

Example 1: Simple join condition

Let's say our information about student absenses and test scores are in separate tables and we want to join them together.

Starting data:

Source table (t0)             Table 1 (t1)    
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

Input fields:

  • Join table: The data we want to join is in Table 1 so we select it here.

  • Join type: Since we only care about students that have both absence and score information we choose inner join which will drop any rows without a match in both tables.

  • Source variable / Joined table variable: We want to match on the variable student which is present in both tables. So we select our source table (t0)'s student on the left, and our joined table (t1)'s student variable on the right.

    • Since this variable has the same name in both tables, they will be combined into one variable in the output table.

Output data:

/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

Example 2: More specific join condition

Let's say instead of a table with aggregated absences we have a daily timeliness chart. We want to join the scores for the corresponding attendance information for both the student and date in question.

Starting data:

Source table (t0)                        Table 1 (t1)    
/*---------+------------+---------*      /*---------+------------+-------*
 | student | date       | on_time |       | student | date       | score |
 +---------+------------+---------+       +---------+------------+-------+
 | jane    | 2020-01-01 | TRUE    |       | jane    | 2020-01-01 | 85    |
 | sam     | 2020-01-01 | FALSE   |       | sam     | 2020-01-01 | 65    |
 | jane    | 2020-02-01 | FALSE   |       *---------+------------+-------*/
*----------+------------+--------*/     

Input fields:

Join type: Since we want to keep our absence data whether a test was taken or not, we do a Left join.

Join condition: We join on fields where BOTH the values in the student and date variables are a match by selecting both in our join condition.

Output data:

/*---------+------------+---------+-------*
 | student | date       | on_time | score |
 +---------+------------+---------+-------+
 | jane    | 2020-01-01 | TRUE    | 85    |
 | sam     | 2020-01-01 | FALSE   | 65    |
 | jane    | 2020-02-01 | FALSE   | NULL  |
*----------+------------+---------+-------*/     

You can use a SQL query step to build more complex join conditions

Example 3: Complex join conditions with cross joins

In some cases you might need to do a more complex join, such as using an operator other than =, or joining on a condition that uses a variable from your joined table that needs to be retyped or created first. This is common in geospatial joins but might come up in any situation.

To use a more complex join condition you can use a cross join, followed by filter. The cross join will make all possible joins between these two tables and then the filter will limit it to rows matching your condition. Effectively this will execute the same as if you had done your initial join with a more complex condition.

Let's say in the example below we want to do an inner join where T0 student_id = T1 student_id but in the source table student_id is an integer type variable and in Table 1 it is a string. We want to retype the variable in Table 1 before we can join on it, but we can't retype it until we have joined in Table 1. So we will do a cross join, a retype, then a filter to narrow down to only records meeting our condition.

Starting data:

Source table (t0)                Table 1 (t1)    
/*------------+----------*      /*------------+-------*
 | student_id | absences |       | student_id | score |
 +------------+----------+       +------------+-------+
 | 01         | 0        |       | 01         | 85    |
 | 02         | 6        |       | 02         | 64    |
 | 03         | 1        |       | 03         | 88    |
*-------------+----------*/      *------------+-------*/

Input fields:

Output data:

/*------------+----------+-------*
 | student_id | absences | score |
 +------------+----------+-------+
 | jane       | 0        | 85    |
 | sam        | 6        | 64    |
 | pat        | 1        | 88    |
 *------------+----------*-------*/

Step: Retype

Overview

The Retype step converts a variable of a given type to another type.

Example starting data

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Retype score from integer to float.

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83.0   |
 | neal    | 35.0   |
 | sam     | 74.0   |
 | pat     | 62.0   |
 *---------+--------*/

Step structure

  • There will be at least one retype block where you will define a variable and a new type.

Field definitions

Field
Definition

Source variable

The variable that you want to retype. Note that you can see its current type by locating it in the variable selector at the bottom of the page, or hovering over the variable in this selection menu to see a tooltip with more information.

New type

The type that the Source variable will be converted to. Can be any of the on Redivis. More on .

If invalid for type, set to null

Whether failing type conversions should be converted to a null value. By default, failed type conversions will throw an error. Note that this might significantly change the content of your data and we suggest using this option with full understanding of how it might affect your outcome and verifying the output results.

Specify custom format

Informs how the data will be read by the method. Only relevant for conversions of strings to date, time, and dateTime. .

Examples

Example 1: Basic usage

We can convert sales data currently stored as an integer type into float to use in another formula elsewhere that only accept the float type.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Source variable: The variable we want to convert is score so we select it here.

  • New type: We want this to be a float, and since the score variable is currently an integer it is compatible with conversion to the float type, so we can select Float here.

  • If invalid for type, set to null: Since there are no incompatible values in this variable it doesn't matter what we choose. We leave it unchecked to validate that we understand our data and to confirm that this transform will execute without failing.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83.0  | jane    | 2020-04-01 |
 | quiz    | 35.0  | pat     | 2020-04-01 |
 | quiz    | 89.0  | sam     | 2020-04-01 |
 | midterm | 74.0  | jane    | 2020-05-01 |
 | midterm | 62.0  | pat     | 2020-05-01 |
 | midterm | 93.0  | sam     | 2020-05-01 |
 | final   | 77.0  | jane    | 2020-06-01 |
 | final   | 59.0  | pat     | 2020-06-01 |
 | final   | 92.0  | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Our output table looks as it should when we look at the cells in the output table, and we can confirm the new type by clicking on the variable in the output table to check the type.

Example 2: Handling invalid conversions

Lets say instead that in our initial data, the sales variable was stored as a string. Converting this to a float would be a bit trickier since the data entry wasn't as clean.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35%   | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Source variable: Same as above example.

  • New type: Same as above example.

  • If invalid for type, set to null: Since this data has 35% as a value, this can't be converted to a float. If we leave this box unchecked our transform will fail. Checking it will set that value to null.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83.0  | jane    | 2020-04-01 |
 | quiz    | null  | pat     | 2020-04-01 |
 | quiz    | 89.0  | sam     | 2020-04-01 |
 | midterm | 74.0  | jane    | 2020-05-01 |
 | midterm | 62.0  | pat     | 2020-05-01 |
 | midterm | 93.0  | sam     | 2020-05-01 |
 | final   | 77.0  | jane    | 2020-06-01 |
 | final   | 59.0  | pat     | 2020-06-01 |
 | final   | 92.0  | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Note that while this retype was successful, our data might not be what we want, and in this case removes information.

Example 3: Parsing dates

Let's say we want to convert our year variable which is currently a string to a Date variable type, but that the starting format does not cleanly translate.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 04/01/2020 |
 | quiz    | 35    | pat     | 04/01/2020 |
 | quiz    | 89    | sam     | 04/01/2020 |
 | midterm | 74    | jane    | 05/01/2020 |
 | midterm | 62    | pat     | 05/01/2020 |
 | midterm | 93    | sam     | 05/01/2020 |
 | final   | 77    | jane    | 06/01/2020 |
 | final   | 59    | pat     | 06/01/2020 |
 | final   | 92    | sam     | 06/01/2020 |
 *---------+-------+---------+------------*/

Input fields:

  • Source variable: The variable we want to convert is date so we select it here.

  • New type: We want this to be a date so we can select Date here.

  • If invalid for type, set to null: Since there are no incompatible values in this variable it doesn't matter what we choose. We leave it unchecked to validate that we understand our data and to confirm that this transform will execute without failing.

  • Specify custom format: Since our data does not fit the standard format (%Y-%m-%d, e.g. 2020-10-01) we need to specify what format it is in. We can use the reference table at the bottom of this page to specify our format: MM/DD/YYYY

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Reference: Type conversion

Note you can see more information on Redivis variable types here.

Starting type
Possible destinations
Notes

String

Integer Float Boolean Date DateTime Time Geography

To integer: A hex string can be cast to an integer. For example, 0x123 to 291 or -0x123 to -291. To float: Returns x as a floating point value, interpreting it as having the same form as a valid floating point literal. To boolean: Returns TRUE if x is "true" and FALSE if x is "false". All other values of x are invalid and throw an error instead of casting to a boolean. A string is case-insensitive when converting to a boolean. To date, dateTime, or time: Uses the canonical format by default (see information below)

Integer

String

Float Boolean

To float: Returns a close but potentially not exact floating point value.

To boolean: Returns FALSE if x is 0, TRUE otherwise.

Float

String Integer

To integer: Returns the closest integer value. Halfway cases such as 1.5 or -0.5 round away from zero.

Boolean

String Integer

To string: Returns true if x is true, false otherwise. To integer: Returns 1 if x is true, 0 otherwise.

Date

String

DateTime

String Date Time

To date, dateTime, or time: Uses the canonical format by default (see information below)

Time

String Date DateTime

To date, dateTime, or time: Uses the canonical format by default (see information below)

Geography

String

Reference: Cannonical representation

When retyping between a String type variable and a Date, DateTime, or Time type variable it is presumed that the data will be in the format below.

Layout
Example

Date

(Four digit year)- (1 or 2 digit month)- (1 or 2 digit date)

2023-01-01 2023-1-1

Time

(1 or 2 digit hour): (1 or 2 digit minute): (1 or 2 digit second). (Up to 6 fractional seconds)

01:01:01.123456 6:2:9 22:19:3

DateTime

(Date specification) (space or T or t) (Time specification)

2023-01-01 01:01:01.123456 2023-1-1T6:2:9

If it is not in this canonical format you can click the Specify custom format field and use Format elements (below) to indicate otherwise.

Reference: Format elements

Since Date, DateTime, and Time variable types contain structured information you can use format strings to indicate how you want different pieces of date and time information translated to and from string format when retyping.

For example when converting a date to a string you can choose whether it will become JAN 1 2023 or 2023-01-01. When converting from a string to a DateTime you'll need to outline how the information is structured in your data so it can be read in correctly.

You can use these elements in the Specify custom format field.

Element
Return

YYYY

Four (or more) digit year

Input: 2023-01-01 Output: 2023

Input: 23-01-01 Output: 0023

Input: 20000-01-01 Output: 20000

YYY

Last three digit year

Input: 2023-01-01 Output: 023

Input: 23-01-01 Output: 023

YY

Two digit year

Input: 2023-01-01 Output: 23

Input: 2-01-30 Output: 02

Y

Last one digit of year

Input: 2023-01-01 Output: 3

MM

Two digit month

Input: 2023-01-01 Output: 23

MON

Three character month: JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, DEC

Input: 2023-01-01 Output: JAN

MONTH

Month name

Input: 2023-01-01 Output: JANUARY

DDD

Three digit day of the year

Input: 2023-01-01 Output: 001

DD

Two digit day of the month

Input: 2023-01-01 Output: 01

D

Day of the week (1-7) with Sunday being 1

Input: 2023-01-01 Output: 1

DAY

Day of the week. Spaces are padded on the right side to make the output size exactly 9.

Input: 2023-01-01 Output: SUNDAY

DY

Three character day: MON, TUE, WED, THU, FRI, SAT, SUN

Input: 2023-01-01 Output: SUN

HH

Two digit hour of the day (valid values from 00 to 12)

Input: 20:10:15 Output: 10

HH12

Hour of the day (valid values from 00 to 12)

Input: 20:10:15 Output: 10

HH24

Two digit hour (valid values from 00 to 24)

Input: 20:10:15 Output: 20

MI

Two digit minute

Input: 20:10:15 Output: 10

SS

Two digit second

Input: 20:10:15 Output: 15

SSSSS

Five digit second

Input: 20:10:15 Output: 15234

FFn

(Replace n with a value from 1 to 9. For example, FF5.)

Fractional part of the second, n digits long. The fractional part of the second is rounded to fit the size of the output.

FF1

Input: 20:10:15 Output: 1

FF2 Input: 20:10:15 Output: 15

FF3 Input: 20:10:15 Output: 015

A.M. or AM P.M. or PM

A.M. (or AM) if the time is less than 12, otherwise P.M. (or PM). The letter case of the output is determined by the first letter case of the format element.

AM

Input: 09:10:15 Output: AM

A.M. Input: 20:10:15 Output: P.M.

PM Input: 09:10:15 Output: AM PM Input: 20:10:15 Output: PM

TZH

Hour offset for a time zone. This includes the +/- sign and 2-digit hour.

Input: 2008-12-25 05:30:00+00 Output: −08

TZM

Minute offset for a time zone. This includes only the 2-digit minute.

Input: 2008-12-25 05:30:00+00 Output: 00

A space

Input: Output:

-./,'l;:

Same character in the output

Input: -./,'l;: Output: -./,'l;:

"Text"

Output is the value within the double quotes. To preserve a double quote or backslash character, use the \" or \\ escape sequence. Other escape sequences are not supported.

Input: "abc" Output: abc

Input: "a\"b\\c" Output: a"b\c

These format elements will only work in the Retype step (or CAST method in SQL). Format elements for using other date parsing or formatting methods are detailed elsewhere and might be useful if your data is not coercible using the format elements described here.

Math

Absolute value

Returns the absolute value of a variable –> learn more

ABS(@variable)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any ,

true

-

Arithmetic

Compute simple arithmetic (+, -, *, /) –> learn more

(@expression @operator @expression_2)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any ,

true

-

@operator

any of: +, -, *, /

true

(E.g., +)

@expression_2

or

any ,

true

-

Ceiling

Returns the smallest integral value that is not less than the provided value –> learn more

CEILING(@variable)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

Integer divide

Divide two integer values, rounding down any remainder –> learn more

DIV(@expression, @expression_2)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@expression_2

or

any

true

-

e ^ x

Compute the natural exponential of a value –> learn more

EXP(@expression)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any ,

true

-

Floor

Returns the largest integral value that is not greater than the provided value –> learn more

FLOOR(@variable)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

Greatest

Find the largest of several values –> learn more

GREATEST(@expression)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

s or s

any , , , , , ,

true

-

Is infinity

Return true if the value is positive or negative infinity, false otherwise. Returns NULL for NULL inputs –> learn more

IS_INF(@variable)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any ,

true

-

Is NaN

Determines whether input value is not a number (NaN) –> learn more

IS_NAN(@variable)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any ,

true

-

Least

Find the smallest of several values –> learn more

LEAST(@expression)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

s or s

any , , , , , ,

true

-

Log

Compute the logarithm of a value to a provided base; generates an error if the variable is <= 0. If no base is provided, defaults to natural log –> learn more

LOG(@expression[, @literal])

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any ,

true

-

@literal

any ,

false

-

Mod

Modulo: compute the remainder of the division of two integers –> learn more

MOD(@expression, @expression_2)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@expression_2

or

any

true

-

Power

Raises a value to a power –> learn more

POW(@expression, @expression_2)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any ,

true

-

@expression_2

or

any ,

true

-

Random

Generate a pseudo-random float between [0, 1) –> learn more

RAND()

Return type

float

Parameters

Round

Rounds a value to the nearest integer (or, if specified, to the provided number of decimal places) –> learn more

ROUND(@expression[, @literal])

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any ,

true

-

@literal

any

false

-

Safe divide

Equivalent to the divisor operator, but returns null if an error occurs, such as a division by zero error –> learn more

SAFE_DIVIDE(@expression, @expression_2)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any ,

true

-

@expression_2

or

any ,

true

-

Sign

Returns the sign (-1, 0, +1) of a numeric variable –> learn more

SIGN(@variable)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any ,

true

-

Sqrt

Compute the square root of a value; generates an error if the variable is less than 0 –> learn more

SQRT(@expression)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any ,

true

-

Truncate

Similar to round, but rounds to the nearest integer whose absolute value is not greater than the absolute value of the provided variable (always rounds towards zero) –> learn more

TRUNC(@expression[, @literal])

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any ,

true

-

@literal

any

false

-

variable
integer
float
variable
literal
integer
float
enum
variable
literal
integer
float
variable
float
variable
literal
integer
variable
literal
integer
variable
literal
integer
float
variable
float
variable
literal
integer
float
date
dateTime
time
string
boolean
variable
integer
float
variable
integer
float
variable
literal
integer
float
date
dateTime
time
string
boolean
variable
literal
integer
float
literal
integer
float
variable
literal
integer
variable
literal
integer
variable
literal
integer
float
variable
literal
integer
float
variable
literal
integer
float
literal
integer
variable
literal
integer
float
variable
literal
integer
float
variable
integer
float
variable
literal
integer
float
variable
literal
integer
float
literal
integer
supported variable types
conversion rules
More on format elements

String

Concat

Concatenates multiple strings into a single string. Treats NULL values as an empty string. –> learn more

CONCAT(@expression)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

s or s

any

true

-

Ends with

Determines whether a string is a suffix of another string –> learn more

ENDS_WITH(@expression, @expression_2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@expression_2

or

any

true

-

Format string

Creates a formatted string from input variables. Similar to the C printf function. –> learn more

FORMAT(@format_string, @expression)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

s or s

any , , , , ,

true

-

@format_string

any

true

-

Length

Returns the number of characters in a string variable –> learn more

LENGTH(@variable)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

Lower case

Return string value to lowercase –> learn more

LOWER(@variable)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

Pad left

Pad a string to the left with characters up to a certain total length –> learn more

LPAD(@variable, @return_length[, @pattern])

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

@pattern

or

any

false

-

@return_length

or

any

true

-

Trim left

Removes all leading characters that match the provided pattern –> learn more

LTRIM(@expression[, @pattern])

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@pattern

or

any

false

-

Normalize

Normalization is used to ensure that two strings are equivalent. Normalization is often used in situations in which two strings render the same on the screen but have different Unicode code points. –> learn more

NORMALIZE(@variable[, @normalization_mode])

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

@normalization_mode

any of: NFC, NFKC, NFD, NFKD

false

(NFC (default))

Regexp contains

Match a variable's values against a regular expression –> learn more

REGEXP_CONTAINS(@variable, @regex)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

@regex

any

true

-

Regexp extract

Returns the first substring that matches a regular expression –> learn more

REGEXP_EXTRACT(@variable, @regex)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

@regex

any

true

-

Regexp replace

Replaces all substrings that match a given regular expression with a new string –> learn more

REGEXP_REPLACE(@variable, @regex, @expression_2)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

@regex

any

true

-

@expression_2

or

any

false

-

Repeat

Return a string of a provided value repeated a set number of times –> learn more

REPEAT(@expression, @literal)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@literal

any

true

-

Replace

Replaces all substrings that match a given string with a replacement string –> learn more

REPLACE(@expression, @match, @expression_2)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@match

or

any

true

-

@expression_2

or

any

false

-

Reverse

Reverses a string –> learn more

REVERSE(@expression)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

Pad right

Pad a string to the right with characters up to a certain total length –> learn more

RPAD(@variable, @return_length[, @pattern])

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

@pattern

or

any

false

-

@return_length

or

any

true

-

Trim right

Removes all trailing characters that match the provided pattern –> learn more

RTRIM(@expression[, @pattern])

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@pattern

or

any

false

-

Starts with

Determines whether a string is a prefix of another string –> learn more

STARTS_WITH(@expression, @expression_2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@expression_2

or

any

true

-

String position

Returns the 1-based index of the first occurrence of a substring within a string. Returns 0 if not found. –> learn more

STRPOS(@expression, @match)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@match

or

any

true

-

Substring

Return the substring of a specified string –> learn more

SUBSTR(@expression, @position[, @length])

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@position

or

any

true

-

@length

or

any

false

-

Trim

Removes all leading and trailing characters that match the provided pattern –> learn more

TRIM(@expression[, @pattern])

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

@pattern

or

any

false

-

Upper case

Return string value to uppercase –> learn more

UPPER(@variable)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

any

true

-

variable
literal
string
variable
literal
string
variable
literal
string
variable
literal
string
integer
float
date
dateTime
time
literal
string
variable
string
variable
string
variable
string
variable
literal
string
variable
literal
integer
variable
literal
string
variable
literal
Redivis type
variable
string
enum
variable
string
literal
Redivis type
variable
string
literal
Redivis type
variable
string
literal
Redivis type
variable
literal
string
variable
literal
string
literal
integer
variable
literal
string
variable
literal
string
variable
literal
string
variable
literal
string
variable
string
variable
literal
string
variable
literal
integer
variable
literal
string
variable
literal
Redivis type
variable
literal
string
variable
literal
string
variable
literal
string
variable
literal
string
variable
literal
string
variable
literal
integer
variable
literal
integer
variable
literal
string
variable
literal
Redivis type
variable
string

Continuous enrollment

The following example is illustrated on Redivis in the MarketScan Continuous Enrollment workflow – you'll need access to MarketScan data to view the details.‌

Many insurance claims datasets on Redivis contain information about enrollment, detailing the periods of time when an individual represented in a dataset was covered by an insurance plan. If you intend to characterize patients based on their insurance coverage (or lack thereof) during certain key events (procedures, diagnoses, etc), it's often important to identify periods of continuous enrollment for each individual – and capture each continuous enrollment period for each patient in a single row.‌

These claims datasets describe enrollment information in multiple discrete rows per patient, each corresponding to patient per month. However, an overall continuous enrollment period may be broken up across rows into months or other non-uniform chunks, so we'll employ the following process to combine multiple sequential rows into a single row with one start date and one end date, describing one continuous period.‌

In this example, we will process the MarketScan Enrollment Detail table to create a table in which each row describes a single period of continuous enrollment. We show an artificial snapshot of the dataset below, where patient 1 has multiple periods of continuous enrollment due to some gaps in coverage, and patient 2 has a single period of continuous enrollment.

patient_id

enrollment_start_date

enrollment_end_date

1

2012-01-01

2012-01-31

1

2012-02-01

2012-02-28

1

2012-04-01

2012-04-30

1

2012-06-01

2012-06-30

1

2012-07-01

2012-07-31

1

2012-08-01

2012-08-31

2

2012-01-01

2012-01-31

We want to create a final table with 3 rows for patient 1 to account for gaps in enrollment in March and May of 2012, and 1 row for patient 2. Our desired output has a row for each continuous period per patient, shown below:

patient_id

enrollment_start_date_continuous

enrollment_end_date_continuous

1

2012-01-01

2012-02-28

1

2012-04-01

2012-04-30

1

2012-06-01

2012-08-31

2

2012-01-01

2012-01-31

The variable names in this example are not actual MarketScan variable names. But, with appropriate data access, you can see the real variables used in the first transform of the Redivis example workflow by hovering over the (renamed) variables patient_id, enrollment_start_date, and enrollment_end_date in the Keep section.‌

(1) Add start and end of adjacent periods to each row

Throughout this example, we'll create variables that partition the dataset by patient identifier (here, patient_id) to ensure that each patient is processed individually. But, to account for the fact that a single patient may have many periods of continuous enrollment, each spanning many months, we need to identify the correct start (from enrollment_start_date) and end (from enrollment_end_date) of a continuous period out of multiple rows and capture them in a single row.‌

First, we create a partition variable lag_end_date using the lag method, which will order each row by enrollment_start_date in a given patient_id partition and copy the previous row's enrollment_end_date value into the each row. We also create lead_start_date using lead, to copy the following row's enrollment_start_date value into each row.‌

These methods generate values which tell us how close the preceding and following enrollment periods are with respect the each row's enrollment period.

patient_id

enrollment_start_date

enrollment_end_date

lag_end_date

lead_start_date

1

2012-01-01

2012-01-31

NULL

2012-02-01

1

2012-02-01

2012-02-28

2012-01-31

2012-04-01

1

2012-04-01

2012-04-30

2012-02-28

2012-06-01

1

2012-06-01

2012-06-30

2012-04-30

2012-07-01

1

2012-07-01

2012-07-31

2012-06-30

2012-08-01

1

2012-08-01

2012-08-31

2012-07-31

NULL

2

2012-01-01

2012-01-31

NULL

NULL

(2) Select rows with start and end of continuous periods

In a second, downstream transform we'll create new variables, which will use the above lead and lag values to identify which rows correspond to the beginning and end of a continuous enrollment period.‌

First, we'll compare enrollment_start_date and lag_end_date to find the difference (in days) of the start of each period and the end of the previous period. and compare enrollment_end_date and lead_start_date to find the end of the period.‌

We see via diff_lag_end_enrollment_start (created using the date diff method) which rows describe an enrollment period directly following the previous period, and which rows describe an enrollment period with a larger gap since the previous period. We also create diff_enrollment_end_lead_start to identify gaps between an enrollment period and the next period.

patient_id

enrollment_start_date

enrollment_end_date

lag_end_date

lead_start_date

diff_lag_end_enrollment_start

diff_enrollment_end_lag_start

1

2012-01-01

2012-01-31

NULL

2012-02-01

NULL

1

1

2012-02-01

2012-02-28

2012-01-31

2012-04-01

1

31

1

2012-04-01

2012-04-30

2012-02-28

2012-06-01

31

31

1

2012-06-01

2012-06-30

2012-04-30

2012-07-01

31

1

1

2012-07-01

2012-07-31

2012-06-30

2012-08-01

1

1

1

2012-08-01

2012-08-31

2012-07-31

NULL

1

NULL

2

2012-01-01

2012-01-31

NULL

NULL

NULL

NULL

Next, we'll encode booleans from our difference variables to simplify further filtering. We'll identify rows corresponding to the start of a continuous period as those with a diff_lag_end_enrollment_start value of either NULL (the row is the first period in the partition) or greater than 1 (the row comes after a gap in enrollment). And we identify rows corresponding to the end of a continuous period as those with a diff_enrollment_end_lead_start value of either NULL (the row is the last period in the partition) or greater than 1 (the row comes before a gap in enrollment).‌

We see via boolean variables is_start_continuous_period and is_end_continuous_period (created the case method) if a given row corresponds to the start of a continuous period, the end of a continuous period, or both.

patient_id

enrollment_start_date

enrollment_end_date

lag_end_date

lead_start_date

diff_lag_end_enrollment_start

diff_enrollment_end_lag_start

is_start_continuous_period

is_end_continuous_period

1

2012-01-01

2012-01-31

NULL

2012-02-01

NULL

1

true

false

1

2012-02-01

2012-02-28

2012-01-31

2012-04-01

1

31

false

true

1

2012-04-01

2012-04-30

2012-02-28

2012-06-01

31

31

true

true

1

2012-06-01

2012-06-30

2012-04-30

2012-07-01

31

1

true

false

1

2012-07-01

2012-07-31

2012-06-30

2012-08-01

1

1

false

false

1

2012-08-01

2012-08-31

2012-07-31

NULL

1

NULL

false

true

2

2012-01-01

2012-01-31

NULL

NULL

NULL

NULL

true

true

Then, to capture only the start and end of a continuous period, we'll use a filter to keep only rows which are true for either is_start_continuous_period or is_end_continuous_period.‌

This leaves us with either 1 or 2 rows corresponding to a continuous enrollment period. If a continuous period spans multiple rows (months, in this case), we'll have 2 rows (a start row and an end row). But if a period only spans one row, we'll have both start and end captured by that 1 row. In our example, the row containing the middle (neither start nor end) of the 2012-06-01 to 2012-08-31 enrollment period for patient 1 was dropped. We can also ignore our intermediate lead..., lag..., and diff... variables, since our final processing step will only consider a row's is_start_continuous_period and is_end_continuous_period values.

patient_id

enrollment_start_date

enrollment_end_date

is_start_continuous_period

is_end_continuous_period

1

2012-01-01

2012-01-31

true

false

1

2012-02-01

2012-02-28

false

true

1

2012-04-01

2012-04-30

true

true

1

2012-06-01

2012-06-30

true

false

1

2012-08-01

2012-08-31

false

true

2

2012-01-01

2012-01-31

true

true

(3) Add end date of following enrollment periods to each row and collapse

Finally, we want to collapse our table to ensure 1 row per continuous enrollment period per patient. Since we have only rows corresponding to start and end of continuous periods, we create another partition variable lead_end_date (in the same transform is fine, since this step will happen after the previous filter) which copies the enrollment_end_date value of the following row on to each row.‌

We can also use the partition row filter to keep only rows with is_start_continuous_period as true, since our lead_end_date has copied over the end date of the continuous enrollment period, contained in each row's following row.

We end up with a table where each and every row contains both the start date of the continuous enrollment period and the end date of that continuous enrollment period.

patient_id

enrollment_start_date

enrollment_end_date

lead_end_date

is_start_continuous_period

is_end_continuous_period

1

2012-01-01

2012-01-31

2012-02-28

true

false

1

2012-04-01

2012-04-30

2012-06-30

true

true

1

2012-06-01

2012-06-30

2012-08-31

true

false

2

2012-01-01

2012-01-31

NULL

true

true

A final processing step captures the correct end date of a continuous period. We now have only rows whose enrollment_start_date value contains the start of a continuous period, but these rows fall into two categories:‌

  • First, we have the rows that do not also correspond to the end of an enrollment period (where the is_end_continuous_period value is false). For these, we want to look at lead_end_date, the end date of the next row, which represents the final date of the continuous period, since we filtered out all the intermediate rows above.

  • Second, we have which also correspond to the end of an enrollment period (where the is_end_continuous_period value is true) – in this example, if the continuous period was only 1 month. For these, the row defines a period (in this example, 1 month) that also contains the end date, so we just get the enrollment_end_date from that same row. Note that the lead_end_date value is incorrect in this case, since the next row contains the start of the next continuous period, or NULL if the period falls at the end of a partition.

We capture the above logic in a new variable enrollment_end_date_continuous, created in an additional transform, since our previous operation involved a partition.

We end up with a final table below, containing the patient identifier, and the start date (renamed to enrollment_start_date_continuous for consistency, and end date of each continuous enrollment period.

patient_id

enrollment_start_date_continuous

enrollment_end_date_continuous

1

2012-01-01

2012-02-28

1

2012-04-01

2012-04-30

1

2012-06-01

2012-08-31

2

2012-01-01

2012-01-31

Redivis API
Python
R
redivis-r
Python
R
Learn more about referencing tables on Redivis >
redivis-python library documentation
redivis-R library documentation
redivis-python library documentation
redivis-R library documentation
via the API
Python
R

Geography

Angle

Takes three point GEOGRAPHY values, which represent two intersecting lines. Returns the angle between these lines. Point 2 and point 1 represent the first line and point 2 and point 3 represent the second line. The angle between these lines is in radians, in the range [0, 2pi). The angle is measured clockwise from the first line to the second line. –> learn more

ST_ANGLE(@geography, @geography_2, @geography_3)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

@geography_3

any

true

-

Area

Returns the area in square meters covered by the polygons in the input GEOGRAPHY –> learn more

ST_AREA(@geography)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

As GeoJSON

Returns the RFC 7946 compliant GeoJSON representation of the input GEOGRAPHY –> learn more

ST_ASGEOJSON(@geography)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

As text

Returns the WKT representation of an input GEOGRAPHY –> learn more

ST_ASTEXT(@geography)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Azimuth

Takes two point GEOGRAPHY values, and returns the azimuth of the line segment formed by points 1 and 2. The azimuth is the angle in radians measured between the line from point 1 facing true North to the line segment from point 1 to point 2. –> learn more

ST_ANGLE(@geography, @geography_2)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Boundary

Returns a single GEOGRAPHY that contains the union of the boundaries of each component in the given input GEOGRAPHY. –> learn more

ST_BOUNDARY(@geography)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Buffer

Returns a GEOGRAPHY that represents the buffer around the input GEOGRAPHY. This function is similar to ST_BUFFERWITHTOLERANCE, but you specify the number of segments instead of providing tolerance to determine how much the resulting geography can deviate from the ideal buffer radius. –> learn more

ST_BUFFER(@geography, @buffer_radius[, num_seg_quarter_circle => @num_seg_quarter_circle][, endcap => @endcap][, side => @side])

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@buffer_radius

or

any ,

true

-

@num_seg_quarter_circle

or

any ,

true

-

@endcap

any of: ROUND, FLAT

false

(Round (default))

@side

any of: BOTH, LEFT, RIGHT

false

(Both (default))

Buffer with tolerance

Returns a GEOGRAPHY that represents the buffer around the input GEOGRAPHY. This function is similar to ST_BUFFER, but you provide tolerance instead of segments to determine how much the resulting geography can deviate from the ideal buffer radius. –> learn more

ST_BUFFERWITHTOLERANCE(@geography, @buffer_radius, @tolerance_meters[, endcap => @endcap][, side => @side])

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@buffer_radius

or

any ,

true

-

@tolerance_meters

or

any ,

true

-

@endcap

any of: ROUND, FLAT

false

(Round (default))

@side

any of: BOTH, LEFT, RIGHT

false

(Both (default))

Centroid

Returns the centroid of the input GEOGRAPHY as a single point GEOGRAPHY. –> learn more

ST_CENTROID(@geography)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Closest point

Returns a GEOGRAPHY containing a point on Geography 1 with the smallest possible distance to Geography 2. This implies that the distance between the point returned by ST_CLOSESTPOINT and Geography 2 is less than or equal to the distance between any other point on Geography 1 and Geography 2. –> learn more

ST_CLOSESTPOINT(@geography, @geography_2)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Contains

Returns TRUE if no point of Geography 2 is outside Geography 1, and the interiors intersect; returns FALSE otherwise. –> learn more

ST_CONTAINS(@geography, @geography_2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Convex hull

Returns the convex hull for the input GEOGRAPHY. The convex hull is the smallest convex GEOGRAPHY that covers the input. A GEOGRAPHY is convex if for every pair of points in the GEOGRAPHY, the geodesic edge connecting the points are also contained in the same GEOGRAPHY. –> learn more

ST_CONVEXHULL(@geography)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Covered by

Returns FALSE if Geography 1 or Geography 2 is empty. Returns TRUE if no points of Geography 1 lie in the exterior of Geography 2. –> learn more

ST_COVEREDBY(@geography, @geography_2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Covers

Returns FALSE if Geography 1 or Geography 2 is empty. Returns TRUE if no points of Geography 2 lie in the exterior of Geography 1. –> learn more

ST_COVERS(@geography, @geography_2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Difference

Returns a GEOGRAPHY that represents the point set difference of Geography 1 and Geography 2. Therefore, the result consists of the part of Geography 1 that does not intersect with Geography 2. –> learn more

ST_DIFFERENCE(@geography, @geography_2)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Dimension

Returns the dimension of the highest-dimensional element in the input GEOGRAPHY. –> learn more

ST_DIMENSION(@geography)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Disjoint

Returns TRUE if the intersection of Geography 1 and Geography 2 is empty, that is, no point in Geography 1 also appears in Geography 2. –> learn more

ST_DISJOINT(@geography, @geography_2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Distance

Returns the shortest distance in meters between two non-empty GEOGRAPHYs. –> learn more

ST_DISTANCE(@geography, @geography_2)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

DWithin

Returns TRUE if the distance between at least one point in Geography 1 and one point in Geography 2 is less than or equal to the Distance argument, otherwise, returns FALSE –> learn more

ST_DWITHIN(@geography, @geography_2, @distance)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

@distance

or

any ,

true

-

Endpoint

Returns the last point of a linestring geography as a point geography. Returns an error if the input is not a linestring or if the input is empty. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –> learn more

[@safe]ST_ENDPOINT(@geography)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@safe

any

true

-

Equals

Returns TRUE if Geography 1 and Geography 2 represent the same GEOGRAPHY value. More precisely, this means that one of the following conditions holds: + ST_COVERS(geography, geography_2) = TRUE and ST_COVERS(geography_2, geography) = TRUE + Both Geography 1 and Geography 2 are empty. –> learn more

ST_EQUALS(@geography, @geography_2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Exterior ring

Returns a linestring geography that corresponds to the outermost ring of a polygon geography. If the input geography is a polygon, gets the outermost ring of the polygon geography and returns the corresponding linestring. If the input is the full GEOGRAPHY, returns an empty geography. Returns an error if the input is not a single polygon. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –> learn more

[@safe]ST_EXTERIORRING(@geography)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@safe

any

true

-

Geo from

Converts an expression for a STRING or BYTES value into a GEOGRAPHY value. If expression represents a STRING value, it must be a valid GEOGRAPHY representation in one of the following formats: WKT, WKB, GeoJSON –> learn more

ST_GEOGFROM(@expression)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

or

any

true

-

Geo from GeoJSON

Returns a GEOGRAPHY value that corresponds to the input GeoJSON representation. –> learn more

ST_GEOGFROMGEOJSON(@geojson[, @make_valid])

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geojson

or

any

true

-

@make_valid

any

true

-

Geo from text

Returns a GEOGRAPHY value that corresponds to the input WKT representation. –> learn more

ST_GEOGFROMTEXT(@wkt[, @oriented][, @planar][, @make_valid])

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@wkt

or

any

true

-

@oriented

any

true

-

@planar

any

true

-

@make_valid

any

true

-

Geo from wkb

Converts an expression for a hexadecimal-text STRING or BYTES value into a GEOGRAPHY value. The expression must be in WKB format –> learn more

ST_GEOGFROMWKB(@wkb)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@wkb

or

any

true

-

Geogpoint

Creates a GEOGRAPHY with a single point. ST_GEOGPOINT creates a point from the specified FLOAT64 longitude and latitude parameters and returns that point in a GEOGRAPHY value. –> learn more

ST_GEOGPOINT(@longitude, @latitude)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@longitude

or

any ,

true

-

@latitude

or

any ,

true

-

Geogpoint from Geohash

Returns a GEOGRAPHY value that corresponds to a point in the middle of a bounding box defined in the GeoHash –> learn more

ST_GEOGPOINTFROMGEOHASH(@geohash)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geohash

or

any

true

-

Geohash

Returns a GeoHash representation of geography_expression. The resulting GeoHash will contain at most Max chars characters. Fewer characters corresponds to lower precision (or, described differently, to a bigger bounding box) –> learn more

ST_GEOHASH(@geography, @maxchars)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@maxchars

or

any

true

-

Geometry type

Returns the Open Geospatial Consortium (OGC) geometry type that describes the input GEOGRAPHY as a STRING. The OGC geometry type matches the types that are used in WKT and GeoJSON formats and printed for ST_ASTEXT and ST_ASGEOJSON. ST_GEOMETRYTYPE returns the OGC geometry type with the "ST_" prefix. –> learn more

ST_GEOMETRYTYPE(@geography)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Intersection

Returns a GEOGRAPHY that represents the point set intersection of the two input GEOGRAPHYs. Thus, every point in the intersection appears in both Geography 1 and Geography 2 –> learn more

ST_INTERSECTION(@geography, @geography_2)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Intersects

Returns TRUE if the point set intersection of Geography 1 and Geography 2 is non-empty. Thus, this function returns TRUE if there is at least one point that appears in both input GEOGRAPHYs. If ST_INTERSECTS returns TRUE, it implies that ST_DISJOINT returns FALSE. –> learn more

ST_INTERSECTS(@geography, @geography_2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Intersects box

Returns TRUE if geography intersects the rectangle between [lng1, lng2] and [lat1, lat2]. The edges of the rectangle follow constant lines of longitude and latitude. lng1 and lng2 specify the westmost and eastmost constant longitude lines that bound the rectangle, and lat1 and lat2 specify the minimum and maximum constant latitude lines that bound the rectangle. –> learn more

ST_INTERSECTSBOX(@geography, @lng1, @lat1, @lng2, @lat2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@lng1

or

any ,

true

-

@lat1

or

any ,

true

-

@lng2

or

any ,

true

-

@lat2

or

any ,

true

-

Is collection

Returns TRUE if the total number of points, linestrings, and polygons is greater than one. An empty GEOGRAPHY is not a collection. –> learn more

ST_ISCOLLECTION(@geography)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Is empty

Returns TRUE if the given GEOGRAPHY is empty; that is, the GEOGRAPHY does not contain any points, lines, or polygons. –> learn more

ST_ISEMPTY(@geography)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Length

Returns the total length in meters of the lines in the input GEOGRAPHY. –> learn more

ST_LENGTH(@geography)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Make line

Creates a GEOGRAPHY with a single linestring by concatenating the point or line vertices of each of the input GEOGRAPHYs in the order they are given. –> learn more

ST_MAKELINE(@geography, @geography_2)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Make polygon

Creates a GEOGRAPHY containing a single polygon from a linestring input, where the input linestring is used to construct a polygon ring. –> learn more

ST_MAKEPOLYGON(@geography)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Max distance

Returns the longest distance in meters between two non-empty GEOGRAPHYs; that is, the distance between two vertices where the first vertex is in the first GEOGRAPHY, and the second vertex is in the second GEOGRAPHY. If Geography 1 and Geography 2 are the same GEOGRAPHY, the function returns the distance between the two most distant vertices in that GEOGRAPHY. –> learn more

ST_MAXDISTANCE(@geography, @geography_2)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Num geometries

Returns the number of geometries in the input GEOGRAPHY. For a single point, linestring, or polygon, ST_NUMGEOMETRIES returns 1. For any collection of geometries, ST_NUMGEOMETRIES returns the number of geometries making up the collection. ST_NUMGEOMETRIES returns 0 if the input is the empty GEOGRAPHY. –> learn more

ST_NUMGEOMETRIES(@geography)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Num points

Returns the number of vertices in the input GEOGRAPHY. This includes the number of points, the number of linestring vertices, and the number of polygon vertices. –> learn more

ST_NUMPOINTS(@geography)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Perimeter

Returns the length in meters of the boundary of the polygons in the input GEOGRAPHY. –> learn more

ST_PERIMETER(@geography)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Point N

Returns the Nth point of a linestring geography as a point geography, where N is the index. The index is 1-based. Negative values are counted backwards from the end of the linestring, so that -1 is the last point. Returns an error if the input is not a linestring, if the input is empty, or if there is no vertex at the given index. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –> learn more

[@safe]ST_POINTN(@geography, @index)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@index

or

any

true

-

@geography

any

true

-

@safe

any

true

-

Simplify

Returns a simplified version of geography, the given input GEOGRAPHY. The input GEOGRAPHY is simplified by replacing nearly straight chains of short edges with a single long edge. The input geography will not change by more than the tolerance specified by tolerance_meters. Thus, simplified edges are guaranteed to pass within tolerance_meters of the original positions of all vertices that were removed from that edge. The given tolerance_meters is in meters on the surface of the Earth. –> learn more

ST_SIMPLIFY(@geography, @tolerance_meters)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@tolerance_meters

or

any ,

true

-

Snap to grid

Returns the input GEOGRAPHY, where each vertex has been snapped to a longitude/latitude grid. The grid size is determined by the grid_size parameter which is given in degrees. –> learn more

ST_SNAPTOGRID(@geography, @grid_size)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@grid_size

or

any ,

true

-

Start point

Returns the first point of a linestring geography as a point geography. Returns an error if the input is not a linestring or if the input is empty. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –> learn more

[@safe]ST_STARTPOINT(@geography)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@safe

any

true

-

Touches

Returns TRUE provided the following two conditions are satisfied: (1) Geography 1 intersects Geography 2 and (2) the interior of Geography 1 and the interior of Geography 2 are disjoint. –> learn more

ST_TOUCHES(@geography, @geography_2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Union

Returns a GEOGRAPHY that represents the point set union of all input GEOGRAPHYs. –> learn more

ST_UNION(@geography, @geography_2)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

Within

Returns TRUE if no point of Geography 1 is outside of Geography 2 and the interiors of Geography 1 and Geography 2 intersect. Given two geographies a and b, ST_WITHIN(a, b) returns the same result as ST_CONTAINS(b, a). Note the opposite order of arguments. –> learn more

ST_WITHIN(@geography, @geography_2)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@geography_2

any

true

-

X Min

Returns a float representing the west-most constant longitude line that bounds the geometry –> learn more

ST_BOUNDINGBOX(@geography).xmin

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

X Max

Returns a float representing the east-most constant longitude line that bounds the geometry –> learn more

ST_BOUNDINGBOX(@geography).xmax

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Point X

Returns the longitude in degrees of the single-point input GEOGRAPHY –> learn more

[@safe]ST_X(@geography)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@safe

any

true

-

Point Y

Returns the latitude in degrees of the single-point input GEOGRAPHY –> learn more

[@safe]ST_Y(@geography)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@safe

any

true

-

Y Min

Returns a float representing the minimum constant latitude line that bounds the geometry. –> learn more

ST_BOUNDINGBOX(@geography).ymin

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Y Max

Returns a float representing the maximum constant latitude line that bounds the geometry. –> learn more

ST_BOUNDINGBOX(@geography).ymax

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Centroid aggregate

Computes the centroid of the set of input GEOGRAPHYs as a single point GEOGRAPHY. –> learn more

ST_CENTROID_AGG(@geography)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

Geo aggregate

Returns a geography variable that represents the point set union of all input geographies. –> learn more

ST_UNION_AGG(@geography)

Return type

geography

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

DBSCAN clustering

Identifies high-density geography clusters and marks outliers in low-density areas of noise –> learn more

ST_CLUSTERDBSCAN(@geography, @epsilon, @minimum_geographies)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@geography

any

true

-

@epsilon

any ,

true

-

@minimum_geographies

any

true

-

variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
literal
integer
float
variable
literal
integer
float
enum
enum
variable
geography
variable
literal
integer
float
variable
literal
integer
float
enum
enum
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
literal
integer
float
variable
geography
boolean
boolean
variable
geography
variable
geography
variable
geography
boolean
boolean
variable
literal
string
variable
literal
string
boolean
boolean
variable
literal
string
boolean
boolean
boolean
boolean
boolean
boolean
variable
literal
string
variable
literal
integer
float
variable
literal
integer
float
variable
literal
string
variable
geography
variable
literal
integer
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
literal
integer
float
variable
literal
integer
float
variable
literal
integer
float
variable
literal
integer
float
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
literal
integer
variable
geography
boolean
boolean
variable
geography
variable
literal
integer
float
variable
geography
variable
literal
integer
float
variable
geography
boolean
boolean
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
boolean
boolean
variable
geography
boolean
boolean
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
literal
integer
float
literal
integer