Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 156 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

Redivis Documentation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Guides

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Reference

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Data retention policy

Overview

Retaining data is vital to establishing long-lasting data linkages and making research available. Redivis' data retention policy is aligned with the goal of keeping all data available and accessible whenever possible.

Owner-initiated deletion

Data owners always have the ability to remove their data from Redivis or make it inaccessible if they choose to do so.

Redivis also has policies in place in case of an accidental deletion:

  • All data is fully backed up, with point-in-time recovery, over a 7 day rolling window. Deletions and modifications to data is permanent after 7 days.

  • All metadata is backed up with point-in-time recovery over a 7 day rolling window, with daily backups of metadata extending to 1 year.

Redivis-initiated deletion

Data storage that exceeds the free tier must be paid for by the data owner or other sponsor. Upon payment failure, every effort will be made to contact the data owner and re-establish payment.

If unsuccessful, data may be destroyed after 30 days of non-payment, though all metadata, and permanent landing pages will be persisted.

Open access

Overview

The ability to view, reproduce, and build upon other works is a core tenet of the scientific process that Redivis is built to uphold and facilitate. Redivis tools for data storage and analysis comply with federal mandates for open access and automatically generate artifacts documenting all steps in the data process, ensuring reproducibility.

Background

  1. Make publications and their supporting data resulting from federally funded research publicly accessible without an embargo on their free and public release.

  2. Enact transparent procedures that ensure scientific and research integrity is maintained in public access policies.

  3. Ensure equitable delivery of federally funded research results and data.

1. Public accessibility

Redivis is built to make data sharing as easy and safe as possible, defaulting to fully accessible data, metadata, and documentation. However, some data must be restricted for privacy and to comply with data licensing agreements. In these cases Redivis follows an "as much as securely possible" sharing philosophy for sharing.

2. Transparent procedures

Redivis systems are built to automatically capture and document any work done on datasets and analyses in standardized formats, while giving researchers and administrators the ability to supplement or override these when necessary.

3. Equitable delivery

Redivis is free and accessible to any researcher, reviewer, or casual data browser. By taking multiple complex data storage and analysis systems and centralizing them with clear UI, it is clearer to understand full workflows without specific technical knowledge.

  • Redivis uses transparent, open-source formats, allowing data, analyses, and workflows to be exported and used within other systems.

  • No-code interfaces allow researchers at any skill level to build and understand research steps.

FAIR data practices

Overview

Findable

Data is findable when (1) data and metadata are assigned a globally unique and persistent identifier, (2) data are described with rich metadata, (3) metadata clearly and explicitly include the identifier of the data it describes, and data and (4) metadata and data are registered or indexed in a searchable resource.

At Redivis:

Accessible

Data is accessible when: (1) data and metadata are retrievable by their identifier using a standardized communications protocol, (2) the protocol is open, free, and universally implementable, (3) the protocol allows for an authentication and authorization procedure, where necessary, (4) metadata are accessible, even when the data are no longer available.

At Redivis:

  • Data can be explored through any web browser.

  • Researcher accounts to apply for data access or do analyses are always free.

Interoperable

Data is interoperable when: (1) data and metadata use a formal, accessible, shared, and broadly applicable language for knowledge representation, (2) data and metadata use vocabularies that follow FAIR principles, (3) data and metadata include qualified references to other data and metadata.

At Redivis:

Reusable

Data is reusable when: (1) data and metadata are richly described with a plurality of accurate and relevant attributes, (2) data and metadata are released with a clear and accessible data usage license, (3) data and metadata are associated with detailed provenance, and (4) data and metadata meet domain-relevant community standards

At Redivis:

  • Many dataset, table, and variable metadata fields are automatically populated based on user action, with the ability to be manually adjusted.

  • All datasets have a Usage tab with information on how it has been viewed and used across Redivis

  • Redivis cloud-based analysis tools encourage data users to do analysis along the data rather than downloading it and breaking linkages.

Introduction

What is Redivis?

Who uses Redivis?

Organization administrators upload, catalogue, and grant access to datasets.

Researchers discover, query., and analyze data.

Redivis makes it easy for everyone to:

Using this documentation

This documentation is broken up into Guides, Examples, and Reference sections.

We recommend that new users begin by following the relevant guides and examples, while the reference section can serve as a comprehensive overview of everything that Redivis can do.

In August 2022, the United States Office of Science and Technology Policy (OSTP) released a to federally funded research (known as “Nelson Memo”) which outlined new requirements affecting both faculty and students who conduct research using federal funding:

Multiple allows data owners to grant access to documentation, metadata, samples, and data separately, so as much access as possible can be granted.

When research is done using restricted data, analysis code is when possible.

Viewing public data and analysis workflows on Redivis is completely public and does not require a .

If any piece of analysis or data is restricted, a user will need a Redivis account to . Redivis accounts are completely free with no restrictions.

are automatically . All changes to variables and rows are recorded and made available to administrators and data viewers.

Redivis offers a no-code interface to build queries that compiles to SQL code. This code is available to the analyst and any viewers.

are . Every step taken in an analysis workflow is recorded sequentially in a visual format that is easy to follow.

Workflows are . Every step is versioned and time-stamped, allowing users to revert to a previous iteration, or a viewer to understand how queries might have changed over time.

Workflow can be , preserving relationships between workflows and datasets.

make data and workflows available to researchers and administrators using their own systems.

Common, open-source coding languages (SQL, Python, R) are a foundation of tools and supported in .

There is no cost to researchers to store , or run data queries.

Redivis is built on data principles to support data discovery and reusability at all levels. Data practices adhering to these principles emphasize data that is:

can be issued for datasets, facilitating authoritative citation and attribution.

All datasets have a section of documentation which auto-populates with administrator actions and allows for additional linking to other artifacts that were part of the dataset creation process.

Redivis has comprehensive search tools that index all aspects of a including the , , variable names and variable documentation.

Public data and analyses can be explored without an .

Researcher accounts can be linked to .

Robust support interoperability with other tools.

use common languages such as SQL, Python, R, Stata, and SAS.

Data and metadata are available for in multiple common formats.

Datasets are automatically .

containing analyses are self-documenting and capture the full analysis pipeline.

Anyone can a workflow they have access to in order to continue an analysis.

is a data platform for academic research. It provides organizations with a central hub where researchers can easily discover, access, and analyze their data.

Interested in bringing your organization on to Redivis?

Certain users may want to dive into the more technical .

memorandum on public access
levels of access
made available
Redivis account
apply for access
Datasets
versioned
transform
analysis workflows
personal data
FAIR
dataset
metadata
documentation
account
institutional login credentials
Analysis tools
download
version
controlled
Workflows
Redivis
Contact us!
Discover datasets
Apply for access
Query large datasets
Analyze data
Collaborate
Host, brand, and distribute data within an organization
Version & document datasets
Integrate with existing data pipelines
API documentation
Findable
Accessible
Interoperable
Reusable

Discover & access data

Overview

Most data on Redivis is uploaded through an Organization, and that is the best place to find Datasets to work with. Some datasets will be public, while others will require certain steps before you can get full access to the data.

1. Find datasets

From an Organization's home page, click on the Datasets tab to browse all datasets and filter by their metadata.

You can use filters on the left bar to narrow down your results. All searches perform a full-text search across a dataset, its documentation, tables, variables, and rich metadata content.

2. Get to know a dataset

Click on any dataset title to go to that Dataset page.

You can view the metadata (including an abstract, documentation, citation information, and a version history) and dig into the data directly on the Tables and/or Files tab.

Each table contains a data browser to view cells, generate summary statistics, and display metadata for this table and each variable. Each file can be previewed and downloaded. Both tables and files can be added to workflows for analysis.

Many datasets on Redivis are public, while others have requirements for certain levels of access enforced by the data owner.

On the right side of the dataset, you can see your current access level. If you have Metadata access you can see the variable information and summary statistics, but you'll need to gain Data access in order to view the data contents or work with this data.

3. Apply for access

On any restricted dataset you can click the Apply for access button in the top right of the page to see a list of steps required for you to gain different access levels to this dataset.

The first step is to become a member of this organization. When you go to apply you will be prompted to become a member of this organization using the credentials on your account.

Once you're a member of this organization, you might need to fill out some requirements that administrators have set up, or request access directly from them.

Requirements are forms that you will need to complete on this page and then submit. Once these are approved by your organization's administrators you will gain access to the dataset.

Next steps

Start working with your data

Once you have data you're interested in, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Upload your own datasets

Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).

Getting started

Overview

Redivis is a platform that allows researchers to seamlessly discover, access, and analyze data. This brief guide will walk you through the basics to get up and running and provide a launching point for exploring other resources in this documentation.

1. Create your account

Many datasets on Redivis are public, and you can browse them without creating an account. However, you'll need an account in order to analyze data, as well as to apply for access to restricted datasets.

Click the Create account button in the top right of any page to sign up. You can use your academic institution's login credentials or any Google account.

Once you create an account, you'll be navigated to your Workspace, which is your private area for creating workflows to work with data, uploading your own datasets, and managing your Redivis account.

2. Join your organization

The main entity on Redivis is a dataset. Most datasets on Redivis are uploaded by organizations.

You might already know the organization that you’ll be working with, or you can browse organizations by clicking the Find organizations button on the Organizations tab your Workspace. You might know your organization's URL already, or see it listed in the Recent section of the Dashboard if you were already viewing that page.

Once you join an organization, you'll be able to apply for any of their restricted datasets, and also see a link on the left bar of your workspace to quickly get back to the organization's home page.

3. Find datasets

The best place to find datasets on Redivis is to search on an organization's home page. Here you can click on the Datasets tab to browse all datasets and filter by their metadata. All searches perform a full-text search across a dataset, its documentation, tables, variables, and rich metadata content.

Click on any dataset title to go to that Dataset page where you can view the data and metadata it contains. The data for this dataset will be available on the Tables tab, which you can explore further.

You will also probably come across restricted datasets. For these you will need to click the Apply for access button on the top right of the Dataset page and complete the requirements to gain approval from the dataset's administrators.

4. Analyze data

Once you've found a dataset that you want to work with, you can add it to a Workflow. Workflows are the fundamental analysis interface on Redivis, where you can query, merge, reshape, and analyze any dataset that you have access to — all from within your web browser.

Add a dataset to a workflow by clicking the Add to workflow button on the top right of the dataset page.

In a workflow we can create a Transform by selecting any table and clicking the +Transform button, which allows us to combine and reshape our data into a final output table that best serves our analysis. These transforms use a powerful SQL engine under the hood, allowing us to query incredibly large tables - even billions of records - in seconds.

After creating output tables for analysis, we can create a computational Notebook in R, Python, Stata, or SAS to further analyze our data and develop our final figures. Select any table and click the +Notebook button to get started. The notebook will initialize and pull in the table you've selected (or a sample of the table if it is a large table).

We can also export and query this table from external environments, allowing us to use whatever analytical tools best suit our research question by clicking the Export table button on the right side of any table. Here we can download the table in a number of common formats, interface with the it programmatically via the API, or export the table to supported integrations.

Next steps

Upload your own datasets

Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).

Administer your organization

Organizations allow for research groups and centers to securely distribute data to their research community. Organization administrators can create datasets, manage access, review logs, and create customized reports of their data utilization.

Workflows

If your organization is part of an institution (such as or ) you can also go to the institution page to search all datasets in your institution.

Not finding what you need? You can also to Redivis through your workspace to augment a research project you're working on and to share with other researchers. Or if you represent a research group, center, or institution, about setting up your own Organization page.

Learn more in the guide.

Learn more in the guide.

Learn more in the guide.

Learn more in the guide.

If your organization is part of an institution (such as or ) you can also go to the institution page to search all datasets in your institution.

Learn more in the guide.

Learn more in the guide.

Learn more in the guide.

Contact an existing administrator to add you to their organization, or to set up a new organization. Learn more in the guide.

Stanford
Columbia
upload your own data
contact us
Discover datasets
Apply to access restricted data
Analyze data in a workflow
Create & manage datasets
Stanford
Columbia
Discover & access data
Analyze data in a workflow
Create & manage datasets
contact us
Administer an organization

Create a study

Overview

Studies are a way for one or more collaborators to work with data on the same conceptual topic. You can add multiple collaborators, datasets, and workflows to a study to organize your workflow. In addition, some restricted data requires you to submit access applications as a study via study requirements.

1. Create a new study

On the studies tab of your workspace you can create a new study and see all studies you are a part of. This study might represent a topic or group you are working with, maybe to investigate a similar cluster of research questions. You can give the study a name and description to reflect that purpose.

2. Add collaborators

If you are working with anyone else on this study you can add them as a collaborator. Anyone who is a collaborator on this study will be able to view and edit the study same as you can. Once added, this study will appear in their workspace.

PI

One person on the study can be designated as the PI. This designation does not have meaning within Redivis interfaces and is here to inform others in the group or administrators you are applying for access from of the group structure.

3. Add datasets

You can add any dataset to this study that you plan on working with, and remove it later if your topic shifts. This space is intended for you to gather resources that you are using in one place to make it easier to create workflows or manage data access. Datasets can be added to any number of studies.

4. Add workflows

You can create a new workflow in this study or move an existing one in. All workflows can be in at most one study, and you can see what study a workflow is in on the workflow's overview page.

Granting workflow access to study members

One of the options for sharing this workflow is to grant view or edit access to all members of the study it's in. This can be an easy way to give everyone in a group access to a workflow you are working on.

As with all workflows, even though someone might have access to the workflow they will need to have independent access to all the datasets it contains in order to view or query them.

5. Apply for data access via a study

If you're working with restricted data you might come across a dataset that has a study requirement as part of its access requirements. You can identify this by the study icon next to the requirement name.

Study requirements are filled out and submitted on behalf of an entire study, rather than one person. An approved study requirement will be valid for all collaborators on the study, unlike member requirements where each individual needs to complete the requirement on their own account.

To submit a study requirement you can navigate to a restricted dataset and open the access modal. You will need to select your study from the dropdown menu, and a submit button will appear. One member of your study will fill this out and submit it on behalf of your group. Once approved, all study collaborators will see an approved requirement in their access modal.

Data usage

When restricted data has a study requirement, the data administrator can approve your application submission for use of the data within that study. In order to query data in a workflow, that workflow will need to be in the approved study. If it is not, you will see a badge noting Limited access and you will not be allowed to run transforms or query this data in a notebook.

Next steps

Start analyzing your data

Once you've gained access and set up a study it's time to add your datasets to a workflow to transform and analyze them leveraging lightning fast tools from your browser.

Apply to access restricted data

Overview

Many datasets on Redivis are public, while others have requirements for certain levels of access enforced by the data owner. If the dataset is public, you can skip over this section, otherwise you'll need to gain access before you can fully utilize the data.

1. View access rules

On the right side of all restricted dataset pages is a box which displays your current access level. If this doesn't say Data access, then your next step will be to apply for access.

Click the Apply for access button on the top of the page to open the access modal. This modal allows you to manage your access to this particular dataset throughout the platform.

Access levels

Dataset access has five levels:

  1. Overview: the ability to see a dataset and its documentation.

  2. Metadata: the ability to view variable names and summary statistics.

  3. Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.

  4. Edit: the ability to edit the dataset and release new versions.

Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.

Usage rules

Usage rules are restrictions placed on the way you can use this data once you've gained access. For example, data exports may be restricted to certain export environments, and/or require administrator approval.

While no action is required when you first apply for access a dataset, it's helpful to understand these restrictions to better understand what limitations, if any, they might pose to your research.

2. Apply for access

The access modal for the dataset will include all steps designated by this dataset's administrators to gain access to work with data.

Membership

To apply for access to datasets hosted by a Redivis organization, you will first need to Join the organization.

To join an organization, you will need to provide information about your identity via an email authentication, usually the email associated with your Redivis account.

After submitting your membership application, it will either be automatically approved or will be marked as pending while an administrator reviews your submission. You'll receive a notification if your submission is approved or rejected.

Requirements

Some datasets have requirements as part of their access applications. Requirements are global to an organization and usually contain some type of form. Once all requirements for a particular level of access are approved, you will gain access to the relevant dataset(s).

In the access modal, click the Apply button on any requirement to fill out and submit the form.

When you submit a requirement, an administrator of the organization will be alerted and can review your submission for approval (in some cases, the requirement may be configured for auto-approval upon submission). You will receive a notification if your submission is rejected or approved, or in the future when it's about the expire. You can also leave a comment for an organization's administrators here, and they can reply to you.

Direct access

Sometimes you don't need to fill out any requirements and need to simply Request access at a specific level. You can do so in this modal, and will receive a notification when you are granted access by the dataset owner.

3. Wait for data owner approval

For some membership and requirement submissions (and all direct access requests), you will need to wait for the data owner to respond to your submission. If the dataset is owned by a user, they will need to approve your request; if it is owned by an organization, any administrator at that organization can approve it.

4. Re-authenticate (if necessary)

If you are accessing restricted data hosted by an organization, you may have to re-authenticate after a period of inactivity anytime you perform an action governed by the "Data" access level.

After a period of inactivity (set between 15 minutes and 3 days by the organization administrator), a popup will ask you to re-authenticate anytime you view data, query tables in a transform, or export data from Redivis.

Next steps

Start working with your data

Add this dataset to a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Reshape data in transforms

Overview

Transforming tables in a workflow is a crucial step when working with data on Redivis. Conceptually, transforms execute a query on a source table and results are materialized in a new output table. They are optimized to run on billions of records in seconds, and create a transparent, reproducible record of data transformation steps you've taken in your workflow.

In most cases you'll want to use transforms to create an output table containing all the information you're interested in before analyzing that table in a notebook or exporting it for further use.

1. Create a transform

Once you've created a workflow and added a dataset to it that you'd like to work with, get started by creating a Transform. You can do this by clicking on any dataset or table.

To build this transform, you will add steps which each take an action to shape the data in the output table. You can choose from many named steps which include a point and click interface that compile to SQL code, or you can add a SQL query step to write code directly. You can always view the code that your step is generating and switch to code if you'd like to edit it.

While SQL might not be a familiar language, it is optimized for data cleaning procedures and allows transforms to execute extremely quickly. It also allows you to write your data transformations in a declarative, reproducible manner.

2. Join additional tables

The first thing you'll want to do is decide if you have all the information you need in this table or if you'd like to join in an additional table or tables. You can reference any other table from either 1) this dataset, 2) another dataset, or 3) an output table from a different transform in this workflow. Any table you want to reference needs to be in this workflow, so you can click the Add dataset button to add a new dataset to the workflow if it's not here already.

To join a table, add a Join step and select the table you'd like to join. You'll then need to select what type of join it will be, and build your join condition. In most cases your join condition will be linking two variables of the same name and type together (e.g. join all records where id = id).

3. Create new variables

You might want to generate new columns in this table by creating a new variable. You can do so by adding a Create variables step. Start by giving this variable a name and then choosing what method you want to use to create it.

Some methods are restricted to the type of variable you are working with. For example there are options to add or subtract years from a date variable, or concatenate string variables.

One of the most common new variable methods is Case (if/else). This allows you to set up a statement that looks at the content of each record and evaluates it based on conditions you've set up to generate the value. For example you can say that in the new variable you're creating, if the amount in column A is greater than 1000, the value of this new variable will be set to "high" and if not, then it will be set to "low."

You can create any number of new variables, and they will execute sequentially, allowing you to reference variable's you've created in other variables and subsequent sections.

4. Filter records

You'll probably want to reduce the number of records in this table to exclude any that aren't relevant to the final output table you're creating. This will allow you to execute transforms quickly and get a better understanding of the number of relevant records you have.

To filter records, add a Filter step to your transform and start building the conditions that records will need to meet in order to stay in your output table. These statements can be nested and allow you to reference any variables or record values.

5. Aggregate data

Depending on the structure of your data you might want to aggregate the data to collapse multiple records down to one, while preserving some information about what was dropped. Conceptually this might look like aggregating a table of charges from one record per charge into one record per person including a variable for the total charge per person.

To get started, add an Aggregate step and select the variables that you will to aggregate on. These will be the variables that exist in your output data after the aggregation is finished. All records that are an exact match in these selected variables will be dropped.

You can also capture information about records being dropped by creating a new aggregate variable. For example, maybe you are aggregating a table with multiple test scores per person down to a table with just one record per person. You can create an aggregate variable with the average test score, or the count of the number of tests each person took.

5. Select variables

Finally, before running your transform you'll always need to select which variables you want to keep in your output table. Perhaps you referenced a variable in this transform to create a new one, and now you don't need it anymore. Cutting variables means faster execution, and it is easy to add any variables back later and re-run the transform if you realize you need it, so in general try to keep this list as short as possible.

To propagate variables forward into your output table, they need to be in the right-hand box at the top of the transform labeled Keep. You can select any variable, or set of variables and click the > arrow button to move them over.

6. Run transform and sanity check output

When you're ready to execute the transform click the Run button in the top right of the toolbar.

If this button is disabled, it might be because the transform is invalid for some reason. Hovering on this button will give you more information, or you can look for the alert symbol (!) to see where you'll need to fix the transform in order to make it valid to run.

Successfully running a transform will create a new output table. Click on this table to see your output and verify that the transform executed as you would expect. Does it have the approximate number of records you expect to see? If you click on a variable you can see summary statistics which can help you verify key variable concepts, such as the distinct number of values.

Every time you run a transform you will want to sanity check your output.

7. Make changes and re-run

If there are any issues with your output table, or if you decide to go in a different direction, or add another step, it is easy to go back to your transform and start making changes.

Run this transform again to see changes in the output table. This guide describes all the steps you can take in a single transform but perhaps you want to do one step at a time and run it in between each step to sanity check your output. This system is designed for iteration so make as many changes as you want to experiment and build the output you want.

8. Create another transform

From here you can continue to create transforms on your output table, tables from the dataset you added, or add any other dataset on Redivis. You might want to fit as many steps as you can into one transform, or make a long chain to more easily track your work and communicate it to others.

As you build more transforms you'll see that sometimes actions you take create stale chains of transforms. You can easily make upstream changes in your workflow (such as upgrading a dataset from the sample to the full dataset, or updating to a new version) and then run all transforms in your workflow with one click (a Run all option is available in the Map button menu).

Next steps

Work with data in a notebook

You can use notebooks in a workflow to analyze data using Python, R, Stata, or SAS. These notebooks run in the browser with no additional configuration and seamless sharing with collaborators.

Export data

Upload your own datasets

Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).

Share and collaborate

Redivis workflows are built for collaboration and include real-time visuals to see where collaborators with edit access are in the workflow, and a comments interface to discuss changes asynchronously.

Work with data in notebooks

Overview

Redivis notebooks a performant, flexible environment for analysis that allow you to analyze and visualize data in workflows in Python, R, Stata, or SAS. With the notebook computation happening on Redivis, you don't need to configure an environment on a local machine or server, or export data from Redivis. This makes for easy iteration and collaboration, not to mention ensuring better security and data throughput.

1. Create a notebook

Once you have a table that you're ready to analyze, you can create a notebook by clicking the + Notebook button at any time. You'll need to name it and choose a kernel (Python, R, Stata, or SAS).

Notebooks can only reference tables within their workflow, so we recommend keeping all related work together in the same workflow.

Python

R

Stata

SAS

2. Define dependencies

3. Compute resources

The default notebook configuration is free, and provides access to 2 CPUs and 32GB working memory, alongside a 60GB (SSD) disk and gigabit network. The computational powerful of these default notebooks are comparable to most personal computers, and will be more than enough for many analyses.

Clicking Edit compute configuration button in the start modal or the toolbar will allow you to choose from different preconfigured machine types. The notebook will then default to this compute configuration each time it starts up.

4. Start the notebook

Notebook nodes need to be started in order to edit or execute cells. When first clicking on a notebook node, you will see a read-only view of its contents (including cell outputs). Click the Start notebook button in the toolbar to connect this notebook to compute resources.

When you create a notebook for the first time it will start automatically.

5. Load data

To do meaningful work in your notebook, you'll want to bring in the tabular and/or unstructured data that exists in your workflow into your notebook.

Referencing tables

Notebooks come pre-populated with templated code that pulls in data from the notebook's source table. You will need to run this cell to pull the data into the notebook, and you can see that it worked because this code will print a preview of the loaded data.

Referencing files

Any files with unstructured data stored in Redivis tables can be referenced by their globally unique file_id. You can also reference these file_id's in any derivative tables, allowing you to query and download specific subsets of files.

When working with large files, you'll want to consider saving the files to disk and/or working with the streaming interfaces to reduce memory overhead and improve performance.

6. Analyze data

At this point, you have all the tools you need to work with your data in your chosen language. The Python, R, Stata, and SAS ecosystems contain myriad tools and libraries for performing sophisticated data analysis and visualization.

7. Create an output table

Notebooks can produce an output table, which you can sanity check and further analyze in your workflow by including in other notebooks or exporting to other systems.

Next steps

Share and collaborate

All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them (much like a Google Doc).

Cite datasets in your publications

Redivis for open science

Redivis is creating tools not just for data distribution and analysis, but also to help drive forward standards for transparent research that is accessible to all.

Reproducibility

Availability

Datasets in any format can be deposited into Redivis and made available immediately to the world via open access licenses at no cost to the end user.

All Redivis datasets and analysis workflows are available to anyone on the internet without a Redivis account unless the owner decides to add restrictions. In this case someone wishing to view the work would need to create a free Redivis account to request access from the owner.

Longevity

The stable technical infrastructure and dedicated Redivis team is funded by a coalition of academic institutions paying license fees for large scale data distribution. This diverse group of funders ensures the long-term availability of the datasets.

Learn more in the guide.

Data: the ability to view and query a dataset's tables, and work with them in .

There are two types of requirements - those you fill out on behalf of yourself (member requirements) and those you fill out on behalf of a (study requirements). If the dataset you are interested in has a study requirement, you'll need to create a study and select it in the access modal in order to submit the requirement. Additionally, any workflows that utilize this dataset will need to be a part of the approved study.

The data owner will receive a notification as soon as the requirement is submitted, and you will receive a notification when it is approved, rejected, or for any requests for additional information. By default, you will receive an email as well as a notification within Redivis. You can customize this notification behavior on your .

Learn more in the guide.

Learn more in the reference section.

Learn more in the reference section.

If you find yourself working with a standard list of values you're using in multiple places, this might also be a place to save time and enhance reproducibility by creating and referencing a .

Learn more in the reference section.

Learn more in the reference section.

Learn more in the reference section.

You'll notice that this transform is now yellow and so is its output table. These edited and stale help you keep track of work you're doing in your workflow, and you can easily revert to the previous successfully executed transform state at any point.

Learn more in the guide.

If you'd like to export data to a different system, you can download it in , reference it programmatically in , or visualize in tools such as .

Learn more in the guide.

Learn more in the guide.

to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.

Before working with a notebook you'll want to get started first by and adding data. You can then create a notebook off of any table in your workflow.

If you are working with very large tables (>10GB is a good rule of thumb), it's always a good idea to first reshape and reduce the data via , since they can be significantly more performant for large data operations than running code in Python, R, Stata, or SAS.

Python notebooks come pre-installed with a variety of common scientific packages for python.

R notebooks come pre-installed with a variety of common scientific packages for R.

Stata notebooks are based off of python notebooks, but offer affordances for moving data between Python and Stata.

SAS notebooks are based off of python notebooks, but offer affordances for moving data between Python and SAS.

All notebooks come with a number of common packages pre-installed, depending on the . But if there is something specific you'd like to include, you can add or write a by clicking the Edit dependencies button in the start modal or the toolbar.

Learn more in the reference section.

If you're working with larger tables, creating an ML model, or performing other particularly intensive tasks, you may choose to configure additional for the notebook. This will cost an hourly rate to run based on your chosen environment, and require you to purchase on your account.

Learn more in the reference section.

You can reference any other tables in this workflow by replicating this script and executing it with a different table reference. As a rule of thumb, notebooks will easily support interactive analysis of tables up to ~1GB; if your table is larger, try reducing it first by creating a , or make sure to familiarize yourself with the tools for working with larger tables in the notebook's programming language.

The notebook interface is based off of , and has similar capabilities. You can also export a read-only copy of your notebook as an .ipynb, PDF, or HTML file.

Learn more in the reference section.

to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.

If the work you're doing leads to a publication, make sure to reference the dataset pages from datasets you've used for information from the data administrators on .

Redivis is a platform for data distribution and analysis that prioritizes and . We believe that supporting open science principles drives better research outcomes, collaboration, and scientific impact.

Every Redivis dataset has a that is automatically generated while the data owner uploads and edits data contents. This history is fully exposed to end users and the dataset can be accessed at any version. Citations are generated for every version of a dataset which include the ORCID iDs for all creators. Data owners can opt to create Digital Object Identifier (DOI) for all versions of their dataset, which is available at a persistent URL along with the required metadata describing the content at that point in time.

As users analyze data in a Redivis each step they take is recorded in a visual tree that can be navigated sequentially. Every step has a full history which can be viewed at any point in time.

Analyze data in a workflow
workflows
study
Work with data in a workflow
Joins
Create variables
value list
Filters
Aggegate
Variable selection
Work with data in notebooks
various file formats
Python / R
Google Data Studio
Export to other environments
Create and populate a dataset
Share your workflow
import redivis

# The source table of this notebook can always be referenced as "_source_"
table = redivis.table("_source_")

# Load table as a pandas dataframe. 
# Consult the documentation for more load options.
df = table.to_pandas_dataframe()

# We can also reference any other table in this workflow by name.
df2 = redivis.table("my_other_table").to_pandas_dataframe()

print(df)
print(df2)
# The source table of this notebook can always be referenced as "_source_"
redivis_table <- redivis$table("_source_")

# Load table as a tidyverse tibble. 
# Consult the documentation for more load options.
df <- redivis_table$to_tibble()

# We can also reference any other table in this workflow by name.
df2 <- redivis$table("my_other_table")$to_tibble()

print(df)
print(df2)
# In order to load data into Stata, we first have to bring it into Python.
# This code loads the "_source_" table in the python variable `df`
# We can then pass this variable as our stata dataset.

import redivis

# The source table of this notebook can always be referenced as "_source_"
# Reference any other table in this workflow by name.
table = redivis.table("_source_")

df = table.to_pandas_dataframe(dtype_backend="numpy")
%%stata -d df -force
/*
# Use the %%stata magic to load our dataframe, specified by the -d parameter
# The -force flag replaces the the current working dataset in Stata

# The rest is just Stata code!
*/

describe
import saspy
sas = saspy.SASsession(results='HTML')

# We first load the table via python, and then pass the dataframe into SAS
df = redivis.table("_source_").to_pandas_dataframe(dtype_backend="numpy")

sas_data = sas.df2sd(df, '_df') # second argument is the name of the table in SAS
sas_data.heatmap('msrp', 'horsepower')
redivis_file = redivis.file("rnwk-acs3famee.pVr4Gzq54L3S9pblMZTs5Q")

# Download the file
download_location = redivis_file.download("./my-downloads")
f = open(download_location, "r")

# Read the file into a variable
file_content = redivis_file.read(as_text=True)
print(file_content)

# Stream the file as bytes or text
with redivis_file.stream() as f:
  f.read(100) # read 100 bytes

with TextIOWrapper(redivis_file.stream()) as f:
  f.readline() # read first line
  
# We can also iterate over all files in a table
for redivis_file in redivis.table("_source_").list_files():
  # Do stuff with file
redivis_file <- redivis$file("s335-8ey8zt7bx.qKmzpdttY2ZcaLB0wbRB7A")

# Download a file
redivis_file$download("/path/to/dir/", overwrite=TRUE)

# Read a file
data <- redivis_file$read(as_text=TRUE)
  
# Stream a file (callback gets called with each chunk)
data <- redivis_file$stream(function(x) {
  print(length(x))
})

# We can also iterate over all files in a table
for (redivis_file in redivis$table("_source_")$list_files()){
  # Do stuff with file
}
# Read table into a pandas dataframe
df = redivis.table('_source_').to_pandas_dataframe()

# Perform various data manipulation actions
df2 = df.apply(some_processing_fn)

# Create an output table with the contents of this dataframe
redivis.current_notebook().create_output_table(df2)
# Read table into a tibble
tbl = redivis$table('_source_')$to_tibble()

# Perform various data manipulation actions
tbl2 = tbl %>% mutate(...)

# Create an output table with the contents of this dataframe
redivis$current_notebook()$create_output_table(tbl2)
%%stata -doutd df2
/*
  Once this cell executes, the current dataset will be pushed 
  to the python variable df2
*/
rename v* newv*
# Via python, pass this dataframe to the output table
redivis.current_notebook().create_output_table(df2)
# Convert a SAS table to a pandas dataframe
df = sas_table.to_df()

# Create an output table with the contents of this dataframe
redivis.current_notebook().create_output_table(df)
creating a workflow
transforms
Learn more about working with python notebooks.
Learn more about working with R notebooks.
Learn more about working with Stata notebooks.
Learn more about working with SAS notebooks.
compute resources
compute credits
Compute resources
transform
See more examples in the Python notebooks reference.
See more examples in the R notebooks reference.
See more examples in the Stata notebooks reference.
See more examples in the SAS notebooks reference.
Jupyter notebooks
Notebooks
how to correctly cite it
open access
FAIR data practices
versioned history
workflow

Example workflows

We've started building a library of common workflow types and analytic tasks that might be useful when you're getting started working with data.

Please let us know if there are other actions or concepts you would like us to provide examples for by reaching out or emailing contact@redivis.com

No-code visualization

This guide demonstrates using a Redivis workflow to gather key variables from different tables, clean them, and and consolidate them into a single table for analysis in Google Looker, all using point and click interfaces.

Workflow objective

We'll use Gini coefficient data (a common measure of income distribution) from the American Community Survey (ACS) to build a map that shows the range of income inequality by US county in 2018.

1. Explore data and start a workflow

  • ACS Gini coefficient table

  • ACS table that measures population by county

  • National Weather Service dataset that maps a US county to lat/long coordinates.

Add this dataset to your workflow and click on the dataset node to see all of the available tables. Select Gini Index by County and can look at the data format by clicking Cells.

2. Clean data

We want to map the Gini estimate and population by US county in our final data visualization. To map data points by county, we will need a variable that represents the latitude, longitude coordinates of that county. We also want the state, county name, and five-digit county code.

Therefore, we want our final dataset to include six distinct variables:‌

  • Gini Estimate

  • State

  • County Name

  • County Code

  • Latitude, Longitude

  • Population

We will get Lat/Long and Population by joining our table with other tables later on in the workflow. For now, we can create a transform to start reshaping our data.

Rename: Gini Estimate

New variables: County and State Names

This method allows users to create a new variable with characters that match a specific pattern of the old variable. For example, creating a new variable for county, we can select all characters in NAME preceding the comma. We can add a new block and do a similar process on the NAME variable to get the State.

New variable: County Code

The last five digits in the GEO_ID variable represent the county code. We create another new variable block and name our new variable County_code. Select the Substring method and select all values starting at index 10 up to a max length of five.

Variable selection

Finally, we choose which variables to keep or discard in our transform. We’ll keep all variables except for GEO_ID, NAME (county and state combined), and B19083_001M (the margin of error for the Gini estimate).

Click the run button in the upper right-hand corner to run this transform.

3. Sanity check the output table

Click on the newly generated output table to investigate the output of the transform. We can see that this table has the same number of records as our initial table and has the four variables we selected. If we click on the County and State variables we can see in the frequency tables that they look like we wanted them to.

Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.

4. ‌Join geographic information

We now have a table with the Gini estimate, state, county name, and county code, but we need the latitude and longitude information in order to map each county and we also want to account for the size of each county with a population variable. To do this, we'll perform two joins.

We could continue to work in our initial transform but we are choosing to create new transforms for each join to keep steps separated conceptually and provide output tables to reference along the way.

Matching county to latitude, longitude data

First, we will match each county code to its lat/long coordinates by joining with the "County to Lat/Long Coordinates" table.

A Left Join returns the complete source table (Gini Index by County) and all matching records from the right table (County to Lat/Long). We set the County_code variable in the source table equal to the County_code variable in the right table, which matches a set of lat/long coordinates to each county code (if there is no coordinate, it will return Null).

When you set two variables equal to each other in a Join, they must be of the same type. If, for example, you set a string type variable equal to an integer type variable, you will have to retype the string as an int (or vice versa).

Some county codes, however, map to more than one set of latitude, longitude coordinates so we must create two new variables in this transform: unique_lat and unique_long.

For unique_lat, we will partition on all kept variables except LAT and take the average of the LAT variable. This says that for that each county, compute the average of all possible latitudes and store that average in the variable unique_lat.

We will select all variables from the source table as well as unique_lat and unique_long in our output table. Finally, we select "distinct" in our variable selector so that we will drop all records that are unique.

Running this transform outputs the following table, where each county now corresponds to a unique set of latitude and longitude variables.

Matching county to population data

We now have all the desired variables in our dataset except for population per county, which we will need when we create the final visualization.‌

Using the same steps from the previous join, we will perform a second join to match each county to its total population by incorporating the Population by County table (also in the original dataset). We select a Left Join and set the County_code variable in the source table (Gini Index by County) equal to the County_code variable in the right table (Population by County).

Selecting all variables to remain in our output table and running this transform outputs the following table.

5. Final data cleaning

We will be using Google Looker Studio to visualize this data so we will need to update our data to make it work smoothly with their specified format for geographic data. Using their reference sheet, we can see that we will need our latitude and longitude information in the format:

Comma separated latitude and longitude decimal values (e.g., "51.5074,-0.1278" specifies to London, England)

We will need to combine our two separate values and add a comma between them.

We can do that using the Concat method for creating a new variable, but that method only accepts string inputs so we'll first need to retyping the unique_lat and unique_long variables from float to string using the Retype step.

Then we use the Create new variables step to make a new variable named lattitude_longitude and use the Concat method to concatenate the unique_lat string, a string with a comma, and the unique_long string.

We can now discard unique_lat and unique_long and keep the combined variable, latitude_longitude.

Running this transform yields the final table. From here, we can edit variable metadata and/or download our new dataset in the format that we'd like.

6. Export data to Looker Studio

The next step would normally be to create a notebook in this workflow and use Python, R, or Stata to analyze this table. However if we want to create a quick visualization and aren't familiar with any of those coding languages we have easy options to export data to other systems such as Google Looker Studio.

For this example we will link this table directly to Google Looker Studio by clicking Export table on this table.

We could download our final table and the re-upload it into Looker Studio, but we choose to link the table through the Redivis connector so that if we come back to this original workflow and make changes they will be reflected in the visual that we're about to make there.

Follow the prompts of the connector to log in and authorize access, then we'll need to indicate the table we'd like to use. For that we'll need:

  • The owner of the workflow's ID (shown in the top menu bar of the workflow)

  • The name of the workflow (shown in the middle of the black workflow toolbar)

  • The name of the table (shown on the title of the table node)

Note that the connector won't recognize some symbols (such as :) so you might need to update one of them if there is an issue connecting.

When the table is being imported we will have the option to change the type for variables. We'll need to change the type of our lattitude_longitude variable from string to the Lattitude, Longitude geography type.

Then click Create report to get started!

7. Build a visualization

For this example we are going to build a bubble map that shows the size of population against the intensity of gini disparity in certain regions.

Get started by adding a chart and selecting the bubble map type. We will need to select Redivis as the data source for this map.

Then we will need to define which of our variables map to which parts of the visual. We will also want to change how they are aggregated, from SUM to AVG.

  • Location: lattitude_longitude

  • Tooltip: County

  • Size: (AVG) population

  • Color metric: (SUM) gini_estimate

The resulting map is automatically interactive. Users can hover over each bubble and view the county name, population, and Gini coefficient. The bubble size is determined by the county population and the bubble color is determined by the Gini value. View the interactive report:

You can continue to create a variety of visualizations in this report, including scatter plots, Google maps, and stacked bar charts.

Next steps

Analyzing large tabular data

This guide demonstrates using a Redivis workflow to gather key variables from different tables, clean them, and and consolidate them into a single table for analysis in a notebook.

Workflow objective

We want to take weather data collected around the world and use it to understand how precipitation trends in the US have changed over time.

1. Explore data

To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need. Some tables jump out as ones we will want to work with:

Daily observations

This table has nearly 3 billion records and seems to be the table in the dataset with the most information so we will start here. There is a variable that looks like it will be very helpful for our goal: element

If we click on the variable name, we can see a searchable frequency table with all this variable's values. One value that jumps out is PRCP which we can see on the value label represents the precipitation. Paired with this is the variable value which contains a numeric value recording the magnitude of the element. We can learn even more about this by clicking on the data dictionary in the dataset's overview page to see that this value is measured in either mm or inches.

We can see that this table doesn't contain any information about location though, so we'll need to find that somewhere else.

Stations

This table contains latitude and longitude variables as well as a linking variable id which will allow us to link in precipitation information from the Daily observations table, since it has the same id variable.

Since there are two tables we want information from we know that we will need to do a join as part of our data reshaping.

2. Create a workflow

At the top of this dataset page we can click the Analyze in workflow button to get started working with this data.

You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.

3. Clean and reshape data

We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with. Even though we might be more comfortable with Python or R, the table we want to work with is 132GB and we would not be able to work with it in a notebook without specialized equipment.

Filter for precipitation

If we click on the element variable we can see that there are many types of elements recorded in this table. Since we only want information about precipitation (PRCP), we will start by filtering out any records with a different value in the element variable.

Compute annual precipitation

Looking at the data, it's clear there are many observations of precipitation at each location over time. We know we want to look at annual trends so we can start by aggregating this data to only have one value per year per location.

Conceptually, we will need to aggregate on the year of each record, but if we look at the date variable in this table it also contains the month and day. For aggregation we will need a field which contains only the year which we can collapse on.

But the most important step here is that we want to gather information about the duplicate records that were dropped! If there were 400 records of precipitation in a year for a particular station, we want to know what those records all add up to. To do this we will Create a new aggregate variable within this aggregation step named annual_precip. We want to aggregate the information in the value column, since that contains the numeric amount of the corresponding element variable. Since we want the total amount across all dropped records we will use the Sum method.

Select variables and run

The final step in a transform is selecting which variables we would like to populate the resulting output table. Since we did an aggregation step, this list becomes limited to only the variables that we aggregated on. Since we want both of these variables in our output we make sure both are selected and visible in the right side list of the footer.

With everything in place we will run this transform to create a new table, by pressing the Run button in the top right corner!

4. Sanity check the output table

Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.

Table characteristics

Click on the output table below the transform to view it. We can see that it contains the two variables we expected it to, based on our variable selection. The table has almost 3 million records, down from the almost 3 billion in our original table, which makes sense given our aggregation step.

Variable characteristics

We can also inspect each variable further. If we click on the year variable we can see that all values are a four digit year value and we can see from the min and max values in the summary statistics that the years range from 1781 - 2021 which make sense. Looking at the frequency table we can see that we have a high frequency of each year represented in the records which is what we expect since there are multiple ids or locations per year.

If we click on the annual precipitation variable we can see a max value of 792,402 (mm or in) which seems very high but possible for a given year. The minimum value is -148,917 which doesn't seem right. When we look at the data dictionary it doesn't indicate that there should be negative values.

We can investigate our data further by clicking on the Query tab of the table and choosing "Sort on variable" from the templates menu, and sort on our annual_precip variable by ascending (ASC). It looks like there are a relatively few number of stations with negative values.

If we were doing a rigorous analysis of this data we might dig deeper into why these value exist in this dataset, or decide to exclude these outliers. However since this is an example we can decide to leave them in and see how it affects our outputs. We can easily go back and exclude them from the workflow at a later point.

5. Join geographic information

Now that we have a table we are satisfied with that contains annual precipitation information, we want to join in the latitude and longitude information about each station so we can make a geographical visual as part of our analysis.

This geographic information is included in the Stations table, so we will need to do a join which connects records from two tables.

While we could do this transformation step in our previous transform and rerun it, let's create a new transform to make this join so that we will have our previous output table saved to compare the output to.

Since our identifier in both table is the variable id, set id from the source table (t0) as the left side of the join condition and set id from the joined table (t1) as the right side of the join condition. This means that for every record in our source table, look for a corresponding value in the Stations table and join the values from additional columns in that table.

When you set two variables equal to each other in a Join, they must be of the same type (string, integer, etc). In this example both id variables are string type so it works, but if one had been an integer we would have needed to retype one of them first to match.

There are a number of additional variables in the stations table but we are only interested in the latitude and longitude variables. In our variable selector we move all variables from our source table (year, id, annual_precip) as well as latitude and longitude.

With everything set up, run this transform and sanity check the output table! We started with a 132GB table and have narrowed it down to 134MB containing only the data we want in the format we will need.

6. Analyze outputs in a notebook

For this workflow we will use Python but you can also use R, Stata, or SAS if you'd like. When you create the notebook for the first time it will start up. Notebooks must be running to execute code.

Redivis notebooks come with many common packages preinstalled and we will use those in this example. If you'd like you can install additional packages by clicking the Dependencies button.

Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.

Reference data

Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook.

Select data from the United States

Now we will create a new cell to organize our work by pressing the + button in the top bar.

In this cell we want to limit our records to only ones whose latitude and longitude values fall in a specific range of the continental United States.

us_df = df[
    (df.latitude >= 24.396308)
    & (df.latitude <= 49.384358)
    & (df.longitude >= -124.848974)
    & (df.longitude <= -66.885444)

Compute average change

We want to see how our annual precipitation variable has changed at stations in the United states so for each station we will group by the id variable and compute an average for annual_precip.

# Compute avg precip for each station between 1980-2010
df_1980_2010_average = us_df[(us_df.year > 1980) & (us_df.year < 2010)].drop(columns=['year'])
df_1980_2010_average = df_1980_2010_average.groupby(['id']).mean()
df_1980_2010_average.rename(columns={"annual_precip": "_1980_2010_avg"}, inplace=True)

# Compute avg precip for each station between 1990-2020
df_1990_2020_average = us_df[(us_df.year > 1990) & (us_df.year < 2020)].drop(columns=['year'])
df_1990_2020_average = df_1990_2020_average.groupby(['id']).mean()
df_1990_2020_average.rename(columns={"annual_precip": "_1990_2020_avg"}, inplace=True)

diff_df = df_1980_2010_average.join(df_1990_2020_average["_1990_2020_avg"])
diff_df['deviation'] = (diff_df._1990_2020_avg - diff_df._1980_2010_avg) / diff_df._1980_2010_avg * 100
diff_df = diff_df[(diff_df.deviation < 25) & (diff_df.deviation > -25)]

diff_df = diff_df.dropna()
diff_df

Plot the results

import plotly.figure_factory as ff

import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
px.set_mapbox_access_token("pk.eyJ1IjoiaW1hdGhld3MiLCJhIjoiY2thdnl2cGVsMGtldTJ6cGl3c2tvM2NweSJ9.TXtG4gARAf4bUbnPVxk6uA")

fig = ff.create_hexbin_mapbox(
    data_frame=diff_df, lat="latitude", lon="longitude",
    color="deviation",
    agg_func=np.mean,
    title="% Change precipitation, 1981-2010 vs 1991-2020",
    range_color=[-15,15],
    nx_hexagon=50, opacity=0.4, labels={"color": "Percent change"}, color_continuous_scale="Icefire_r",
)

fig.update_layout(margin=dict(b=0, t=0, l=0, r=0))
fig.show("nteract")

Now we've created an interactive figure showing change in precipitation by location in the United states.

Next steps

Citations

Citing datasets

It's important to make sure that you fully cite any datasets used in your work. Each dataset page includes citation information in the provenance section in multiple format options.

Stanford Center for Population Health Sciences. (2022). GHCN Daily Weather Data (Version 1.2) [Data set]. Redivis. https://doi.org/10.57761/h9ff-vy04

Citing workflows

Workflows across Redivis also include citation information in the provenance section in multiple format options.

Bibliography

Datasets and workflows on Redivis can be supplemented with related identifiers that situate a resource in its broader context. You can view the bibliography for a dataset or workflow from the corresponding provenance section.

Citing Redivis

To cite Redivis in publications use:

Redivis (RRID:SCR_023111). https://redivis.com

If referencing Redivis inline, the abbreviated citation is also acceptable:

Redivis (RRID:SCR_023111)

To cite Redivis, you can use the following BibTeX:

@misc{https://doi.org/10.25495/7gxk-rd71,
  doi = {10.25495/7GXK-RD71},
  url = {https://redivis.com/},
  author = {Redivis},
  keywords = {FOS: Dataset, Publication},
  language = {en},
  title = {Redivis},
  publisher = {Redivis},
  year = {2024}
}

You can also refer to the Redivis Re3data identifier

Running ML workloads

Overview

Notebooks on Redivis offer a performant and highly flexible environment for doing data analysis. This includes the ability to run state-of-the-art machine learning (ML) models, including the ability to train new models, fine-tune various existing models, and use these models to perform inference and generate novel outputs.

This guide is focused on a number of common use cases when running ML workloads on Redivis. Here we generally focus on using the Hugging Face + PyTorch ecosystem in Python, though these examples are broadly applicable to other ML libraries and languages.

For a detailed example of using Redivis to fine tune a large-language model, see the complementary example:

1. Create a notebook with appropriate computation capacity

Training ML models and running inference can require a substantial amount of compute capacity, depending on various factors such as your model and dataset size, usage parameters, and performance goals.

The default, free notebook on Redivis offers 2CPUs and 32GB of RAM. While this may work for initial exploration, running practical machine learning workflows typically requires the availability of a GPU. When creating your notebook, you can choose a custom compute configuration to match your needs.

Redivis offers a number of custom compute configurations, mapping to the various machine types available to Google Cloud. We recommend starting with a more modest GPU for initial exploration, and then upgrading as needed when computational or performance bottlenecks are reached. For this example, we'll use the NVIDIA L4 GPU, which provides reasonable performance at a reasonable cost.

2. Define dependencies

The Redivis python notebook is based off the jupyter-pytorch notebook image, with PyTorch, CUDA bindings, and various common data science libraries pre-installed. However, if you require additional dependencies for your work, you can specify them under the "dependencies" section of your notebook.

2a. [Optional]: Pre-load external models when internet is disabled

Models can also be stored within a dataset on Redivis, in which case you won't need to pre-load the model, but can rather import it directly from the Redivis dataset (see step 3 below)

If your notebook references export-restricted data, for security reasons, internet will be disabled while the notebook is running. This can present a challenge for some common approaches to ML in python, such as downloading a model or dataset from Hugging Face. For example, we might reference a model as follows:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

This code typically downloads the model from Hugging Face and caches it to our local disk. However, if internet is disabled, this command will hang and ultimately fail. Instead, we need to download the model during notebook startup, before the internet is disabled, as part of the post_install.sh script under the notebook's dependencies:

python -c '
from huggingface_hub import snapshot_download
snapshot_download(repo_id="sentence-transformers/all-MiniLM-L6-v2")
'

This will download the model weights and other files to the default Hugging Face cache directory, ~/.cache/huggingface/hub.

Now, within our notebook, we can load the cached model. Make sure to set local_files_only=True, so that Hugging Face doesn't try to connect to the internet to check for a newer version of the model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2", local_files_only=True)

Some libraries may require you to provide a full path to the model files. For the model mentioned above, this would be ~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/<snapshot_id> .

To find the appropriate path, list the contents of the ~/.cache/huggingface/hub directory and its decendants.

3. Load a model

3a. Load a model from Redivis

Machine learning models can be stored directly within a Redivis dataset as unstructured files. For example, this dataset contains the various files that make up the bert-base-cased model on Hugging Face. We can then download the model to our notebook's local filesystem:

import redivis
table = redivis.organization("demo").dataset("huggingface_models").table("bert_base_cased")
table.download_files("/scratch/bert-base-cased")

And then reference this as a local model. E.g.:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("/scratch/bert-base-cased", num_labels=5)
tokenizer = AutoTokenizer.from_pretrained("/scratch/bert-base-cased")

When using models stored on Redivis, we don't have to worry about whether our notebook has internet access, nor do we need to rely on the future availability of that particular model on Hugging Face.

To download a model from Hugging Face and save it to a Redivis dataset, you can either download the files from Hugging Face + re-upload them to Redivis, or alternatively, you can use a notebook to programmatically upload the files. E.g.,:

from huggingface_hub import snapshot_download
import redivis

# Download the model files from Hugging Face
snapshot_download(repo_id="google-bert/bert-base-cased")

# Specify an existing dataset and table on Redivis. 
# Consult the python docs for how to programmatically create datasets (apidocs.redivis.com)
table = redivis.organization("demo").dataset("huggingface_models").table("bert_base_cased")

# Add the downloaded model files to the table
table.add_files(directory='/home/root/.cache/huggingface/hub/models--google-bert--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e')

3b. Load a model from an external source

If your notebook has internet access, you can also use any other models that may be available on the internet. For example, we can load the same bert-base-cased model directly from Hugging Face:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

# Do cool things!

Note that if you are working in a notebook with disabled internet, this approach won't work, and you'll need to use the methods mentioned in either 2a or 3a above.

4. Load data

As a final step, you'll likely want to load data into your notebook to either fine-tune the model, perform inference, or otherwise experiment. There are thousands of datasets on Redivis, and you can upload your own data as well. You can learn more about loading data into a notebook in the python notebooks documentation. As a quick example:

import redivis

# Load tabular data as a pandas data frame, or a number of other formats (arrow, polars, iterator, etc)
table = redivis.organization("demo").dataset("ghcn_daily_weather_data").table("stations")
df = table.to_pandas_dataframe()

# Download unstructured data as files to your local disk
files_table = redivis.organization("demo").dataset("chest_x_ray_8").table("images")
files_table.download_files("/scratch/xray_images")

Of course, assuming your notebook has access to the external internet, you can also call various APIs to load external data sources.

from datasets import load_dataset

# load a dataset from Hugging Face
hf_dataset = load_dataset("Yelp/yelp_review_full") 

Next steps

At this point, you have all the tools at your disposal to perform cutting edge ML research. But of course, what you do next is totally up to you. We do recommend further familiarizing yourself with the examples and detailed documentation to take full advantage of the capabilities of Redivis notebooks:

Analyze data in a workflow

Overview

Workflows are where you work with data on Redivis. In a workflow you can query, merge, reshape, and analyze any data that you have access to, all from within your web browser.

In a workflow, you can construct reproducible data transformations and analyses, and share and collaborate with your peers in real time.

1. Create a workflow

Add a dataset to a new or existing workflow from any Dataset page where you have "Data access" by clicking the Analyze in workflow button.

Within workflows you can navigate between entities on the left side of the screen, and inspect them further on the right panel. You can inspect your dataset further by clicking on any table to see its cells and summary statistics.

To add more data to this workflow you can click the Add data button in the top left of the workflow toolbar. It is also possible to add other linked workflows to this workflow. This is useful as you develop more complex analyses that you want to segment into discrete pieces of work that you can link together.

You can find this workflow later by going back to your workspace.

2. Transform data

Transforming tables is a crucial step in working with data on Redivis. Conceptually, transforms execute a query on source table(s), whose results are materialized in a new output table. In most cases you'll want to use transforms to reshape your data to contain the information you're interested in, before analyzing that table in a notebook or exporting it for further use.

To create a Transform, select a table in this dataset and click the +Transform button. You can get started here building a query through the point and click interface or writing SQL code by adding a SQL step.

For all transforms you will need to select which variables you want to keep in your output table. The rest of the steps are up to you. Some common operations you can get started with include:

Once you've built your query, execute it by clicking the Run button in the top right of the transform. This will create a new output table where you can inspect the output of your query by clicking on the table beneath the transform in the map and making sure it contains the data we would expect.

From here you can create a new transform from this output table to continue reshaping your data, or go back to your original transform to make changes and rerun it.

As you become more familiar with transforms, you can start doing more advanced work such as geospatial joins, complex aggregations, and statistical analyses.

3. Analyze data in a notebook

Once you have a table you're ready to analyze, you can select any table and click the + Notebook button to create a notebook that references this table.

The default notebook configuration is free, and provides access to 2 CPUs and 32GB working memory, alongside a 60GB (SSD) disk and gigabit network. The computational powerful of these default notebooks are comparable to most personal computers, and will be more than enough for many analyses.

From here it’s all up to you in how you want to analyze and visualize your data. Once you’ve finalized your notebook, you can easily export it in different formats to share your findings!

4. Share and collaborate

You can share your in-progress work or finished results with collaborators by sharing this workflow.

Researchers can work side by side in this workflow in real-time. Leave comments to communicate, and see a visual cue for what each person is working on. You can even collaborate within a running notebook at the same time.

If any of the data in your workflow is restricted, your collaborator must also have access to those datasets in order to view their derivatives within your workflow.

Next steps

Share and collaborate

Redivis workflows are built for collaboration and include real-time visuals to see where collaborators with edit access are working in the workflow, and a comments interface to discuss changes asynchronously.

Export data

Browse our example workflows

Redivis workflows excel at working with large tables, whether it's filtering and joining, complex aggregation and date manipulation, or visualization and analysis.

Export & publish your work

Overview

Redivis combines powerful functionality to reshape and analyze data on platform, with an easy export and publishing flow, to ensure the results of your work can be displayed in the format of your choice and shared with your collaborators and research community.

Working to determine, first, the broader picture of the content you'd like to publish and where you'd like to publish it will help you, then, determine the desired shape and format of the assets to be exported and, finally, the specific sources of the data in your Redivis workspace.

1. Develop a publishing strategy

Ask yourself: What story are you trying to tell? Who/where is the audience? What component pieces are crucial to building the narrative? Your ideal package of assets may be very different if you're hoping to share progress on a workflow with a principal investigator, take a snapshot of a single variable distribution for a colleague, publish results in a journal, or build a custom dashboard to highlight multiple trends.

You may use a combination of tables, unstructured files, code snippets, graphs, and descriptive text to be published, so sketching out these component pieces in increasing detail will help define your end product.

2. Choose your desired formats

With an initial strategy in mind, learning about the different types of Redivis exports will help define your list of assets to generate.

3. Identify the data sources

Next steps

Upload your own datasets

Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).

Build a custom dashboard

For full customizability, you can publish a static site that accesses Redivis data to power an interactive visual dashboard – a more permanent, web-based approach to highlighting results of your data analysis or generated data.

Discover datasets

Overview

Datasets are the core entity on Redivis, and finding the dataset(s) you want to work with is generally one of your first steps. All datasets are either uploaded by an organization or a user. If a user has uploaded a dataset they will need to share it with you directly, but you can browse datasets upload by any organization on their home page.

1. Find your organization

There are many ways to find your organization. If you know what you're looking for you can type it into the "Go to" bar on the top of the page. To browse all organizations you can go to the Organizations tab of your workspace where you can Find organization.

Note that some organizations may be hidden from public view, in which case you will need to contact someone at the organization to get the direct link.

2. Browse available datasets

On the organization home page you can see all of the organization's datasets that you have at least Overview access to. If you're looking for a dataset you expect to be here and you don't see it, you might need to contact the organization administrators to get access.

You can see the datasets that this organization has featured on the left section of the page, type a term into the search bar, or click the Datasets tab to browse everything. By default these datasets are sorted by popularity but you can reorder them however you'd like to find what you want.

You can also click the Workflows tab to see workflows featured by this organization which showcase how their data can be used. More on workflows later!

3. Search for data

Once on the dataset tab you can use multiple filters along with search terms to find what you'd like. All searches take a comprehensive look at information from the dataset documentation as well as variable metadata, including value labels.

You can also use the filters to find specific information you're looking for. If you know the name of the variable or year of the data you're interested in, narrow your search here.

4. Preview tables

Each dataset that appears here matches all search and filter configurations. You can click on the dataset tile to see more information, including a preview of all of the tables this dataset contains.

Click on any of these tables to dig into it further. As long as you have metadata access you can see the variable information and summary statistics.

Once you find a dataset you want to explore further, click on the title to go to the dataset page.

5. Explore the dataset page

This dataset page has all of the information you'll need to familiarize yourself with the data and get started working with it further. Add this dataset to your dataset library to easily find it later on by clicking the button alongside the dataset title.

Overview

This tab has a short summary of the dataset along with different sections filled out by the administrators such as the Methodology, Documentation, and Provenance information. This Provenance section will also have citation information and a DOI if one has been created. You can also see your current access level on the right side of the page.

Tables

This tab contains all of the dataset's tabular data. You can see each table alongside information about its contents. Click on each one to dig deeper into the data.

Files

This tab contains all of the dataset's unstructured data. You can preview each file by clicking on it.

Usage

This tab has information about how other people have used this dataset on Redivis. You can see the most popular variables across all tables to get a sense for which ones might be a good starting place to understanding the structure of the data.

Next steps

Apply for access to restricted data

If you don't have data access to this dataset you'll need to apply for access before you can work with it further

Start working with your data

Add this dataset to a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Data repository characteristics

Overview

Redivis is designed to support data-driven research throughout the research lifecycle, and can serve as a permanent repository for your data, analytical workflows, and data derivatives.

Redivis is specifically designed to meet these desirable characteristics of a data repository, outlined below:

Desirable Characteristics for All Data Repositories

Additional Considerations for Human Data


Detailed information

Unique Persistent Identifiers

Assigns datasets a citable, unique persistent identifier, such as a digital object identifier (DOI) or accession number, to support data discovery, reporting, and research assessment. The identifier points to a persistent landing page that remains accessible even if the dataset is de-accessioned or no longer available.

Long-Term Sustainability

Has a plan for long-term management of data, including maintaining integrity, authenticity, and availability of datasets; building on a stable technical infrastructure and funding plans; and having contingency plans to ensure data are available and maintained during and after unforeseen events.

Redivis undergoes annual security audits and penetration testing by an external firm, and utilizes a formalized software development and review process to maintain the soundness of its technical infrastructure.

Funding for Redivis is provided by recurring annual subscriptions from its member academic institutions. This model provides consistent annual revenue to support the ongoing maintenance of the platform. Redivis is an employee-owned company without any external investors with an equity stake, allowing us solely focus on the needs of our customers, our employees, and our mission of improving accessibility in the research data science.

Metadata

Ensures datasets are accompanied by metadata to enable discovery, reuse, and citation of datasets, using schema that are appropriate to, and ideally widely used across, the community(ies) the repository serves. Domain-specific repositories would generally have more detailed metadata than generalist repositories.

As a generalist repository, the metadata schema is intentionally broad and flexible, but specific groups on Redivis can choose to enforce more specific metadata standards within their datasets.

Curation and Quality Assurance

Provides, or has a mechanism for others to provide, expert curation and quality assurance to improve the accuracy and integrity of datasets and metadata.

It is ultimately up to the editors of a dataset to provide curation, though Redivis is designed to support this process as much as possible. Redivis automatically computes checksums and runs fixity checks on all uploaded files, and computes univariate summary statistics of all variables to aid in the quality assurance process. Metadata completeness is also reported to editors, encouraging them to provide as much information as possible.

Free and Easy Access

Provides broad, equitable, and maximally open access to datasets and their metadata free of charge in a timely manner after submission, consistent with legal and ethical limits required to maintain privacy and confidentiality, Tribal sovereignty, and protection of other sensitive data.

Broad and Measured Reuse

Makes datasets and their metadata available with broadest possible terms of reuse; and provides the ability to measure attribution, citation, and reuse of data (i.e., through assignment of adequate metadata and unique PIDs).

Clear Use Guidance

Provides accompanying documentation describing terms of dataset access and use (e.g., particular licenses, need for approval by a data use committee).

Security and Integrity

Has documented measures in place to meet generally accepted criteria for preventing unauthorized access to, modification of, or release of data, with levels of security that are appropriate to the sensitivity of data.

Redivis is well-designed to handle workflows around sharing of sensitive and high-risk data. It provides technical mechanisms to support the reuse of sensitive data when allowed, while enabling the enforcement of appropriate guardrails and access controls defined by data administrators.

Confidentiality

Has documented capabilities for ensuring that administrative, technical, and physical safeguards are employed to comply with applicable confidentiality, risk management, and continuous monitoring requirements for sensitive data.

In addition to an annual security audit, Redivis also undergoes annual penetration testing by an outside firm to further ensure the soundness of its security posture.

Common Format

Allows datasets and metadata downloaded, accessed, or exported from the repository to be in widely used, preferably non-proprietary, formats consistent with those used in the community(ies) the repository serves.

Provenance

Has mechanisms in place to record the origin, chain of custody, and any modifications to submitted datasets and metadata.

Retention Policy

Provides documentation on policies for data retention within the repository.


Fidelity to Consent

Uses documented procedures to restrict dataset access and use to those that are consistent with participant consent and changes in consent.

Access rules on Redivis are defined and enforced by the dataset owner.

Restricted Use Compliant

Uses documented procedures to communicate and enforce data use restrictions, such as preventing reidentification or redistribution to unauthorized users.

Privacy

Implements and provides documentation of measures (for example, tiered access, credentialing of data users, security safeguards against potential breaches) to protect human subjects’ data from inappropriate access.

Additional technical safeguards ensure data privacy. Data users can establish their identity through their institutional identity provider, and Redivis undergoes regular security audits to ensure the soundness of its security posture.

Plan for Breach

Has security measures that include a response plan for detected data breaches.

Redivis has detailed internal security protocols and documented security breach plans which are regularly exercised by technical personnel via tabletop exercises. All systems are continuously monitored for potential breaches, with immediate alert pathways and clear escalation protocols to respond to any breach.

Download Control

Controls and audits access to and download of datasets (if download is permitted).

Violations

Has procedures for addressing violations of terms-of-use by users and data mismanagement by the repository.

As a matter of policy, Redivis aims to be as permissive as possible, recognizing that often misuse is the result of accidental behavior or misunderstanding. These controls are designed to protect the system for all users, and are not intended to ever be punitive towards good-faith actors.

Request Review

Makes use of an established and transparent process for reviewing data access requests.

! We also suggest you recreate this workflow as we go to best learn the process.

We have already uploaded all relevant data for this workflow into a single dataset in the Redivis Demo organization called which contains contains three tables:

We already have this variable in our dataset: B19083_001E. We will create a step to rename this to gini_estimate so we can keep better track of it.

We want to break up our NAME variable into two separate columns, county and state. So we will add the step and use the Regexp extract method.

Create a new transform and add a step.

Create a new transform and add a step.

Refer to the for additional specific guidance!

Perhaps we see something in this workflow we want to tweak, or we want to go back and change some of our data decisions. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and them.

Notebooks can also which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also this workflow to work on a similar analysis, or any table in this workflow for work elsewhere.

! We also suggest you recreate this workflow as we go to best learn the process.

All the weather data we need is contained in the Demo organization dataset .

Start by creating a new transform on the Daily observations table, and add a step. Configure the filter to select for records where element = PRCP.

In this same transform we will add a step and make a new variable named year using the date extract method, which will pull the year out of the date value.

Then we will add an step to aggregate on the values based on our new year variable. The first thing we will do is select the variables to aggregate on, or select which variables that we want all records with duplicate values to be grouped together. Since we want information on precipitation per station per year we should choose to aggregate on the id variable (station) and year variable. When executed this will group all records with the same combination of values in id and year, drop all other variables from the table, and then drop duplicate records down to just one record per unique combination in those two variables.

In the new transform add a step. Set the joined table to the Stations table from our original dataset. The join type will depend on what variables we want to keep based on how the matching process goes. Since we only want to keep records that have both information from our source table (annual precipitation) AND information from our joined table (station lattitude and longitude) we set the join type to Inner join.

Next we want to generate an interactive visual to understand annual precipitation events in different locations over time. To do this we will create a node on our output table.

You can use this pandas code or replace it to use the of your choice. To use the standard code, click inside this cell and press the run button, or the Shift + Enter keys.

Now that we have the information calculated, we can plot it! We'll use to do the heavy lifting, and point it at our relevant variables.

Perhaps we see something in this plot we want to investigate further, or we want to go back and change some of our assumptions. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and them.

Notebooks can also which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also this workflow to work on a similar analysis, or any table in this workflow for work elsewhere.

For example the hosted by the Stanford Center for Population Health Sciences would be cited in the APA format as:

in any other dataset table or output table in this workflow

records to match defined parameters

variables or

data

Learn more in the guide.

Notebooks are available in Python and R, as well as Stata or SAS (with a corresponding license). Notebooks come pre-installed with common libraries in the data science toolkit, but you can also customize the notebook’s and startup script to create a custom, reproducible analysis environment that meets your needs.

If you're working with larger tables, creating an ML model, or performing other particularly intensive tasks, you may choose to configure additional for the notebook. This will cost an hourly rate to run based on your chosen environment, and require you to purchase on your account.

Notebooks come pre-populated with some starter code you can use to import data, and the contain comprehensive documentation and further examples.

To learn more about analyzing data, see our guide.

to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.

If you'd like to export data to a different system, you can download it in , reference in , or visualize in tools such as .

Learn more in the guide.

Learn more in the guide.

Tabular data containing rows and columns can be in a variety of formats, accessed from many environments via our client libraries, or in a website.

Unstructured data files of any type can be in their original format, or accessed .

Notebooks containing code inputs and corresponding outputs can be as an .ipynb file or PDF or HTML, or (coming soon!) embedded in your site.

Learn more in the guide.

In a , you can generate output tables to capture the result of a set of data transformations, or use notebook to show a line-by-line data analysis flow and relevant figures.

As a , you can upload your own tabular data and unstructured files to create assets to use in your workflows or share with others.

Whether you're manipulating data in a workflow to showcase results or hosting your own dataset, you'll want to build a set of specific tables or notebooks you're trying to share. As you iteratively modify the shape and content of these assets – in a workflow to output new tables or to build analysis flows in Python or R – you'll fine-tune each piece of your final publication.

Learn more in the guide.

Learn more in the guide.

If your organization is part of an institution (such as or ) you can also go to the institution page to browse all organizations in that institution.

Learn more in the guide.

Learn more in the guide.

In May 2022, the Subcommittee on Open Science (SOS) of the United States Office of Science and Technology Policy (OSTP) outlining the "desirable characteristics of data repositories".

These characteristics are intended to help agencies direct Federally funded researchers toward repositories that enable management and sharing of research data consistent with the principles of FAIR data practices. Various agencies have adopted these guidelines, .

Assigns a persistent identifier (PID), such as a DOI

Identifier points to a persistent landing page

Plan for long-term data management

Maintain integrity, authenticity, and availability of datasets

Stable technical infrastructure

Stable funding plans

Contingency plans to ensure data are available and maintained

Datasets accompanied by metadata

Aids in the easy discovery, reuse, and citation of datasets

Schema appropriate to relevant data communities

Provide or allow others to provide expert curation

Quality assurance for accuracy and integrity of datasets and metadata

Broad, equitable, and maximally open access to datasets and metadata

Access is free of charge in a timely manner consistent with privacy

Makes datasets and metadata available to reuse

Provides ability to measure attribution, citation, and reuse of data

Provides documentation for access and use

Prevents unauthorized access, modification, and release of data

Ensures administrative, technical, and physical safeguards

Continuous monitoring of requirements

Download, access, and export available in non-proprietary formats

Ability to record the origin, chain of custody, and modification of data or metadata

Provides policy for data retention

Utilizes consistent consent

Enforces data use restrictions

Implements measures to protect data from inappropriate access.

Has a response plan for detected data breaches.

Controls and audits access to and download of datasets.

Has procedures for addressing violations and data mismanagement.

Process for reviewing data access requests.

Data can be uploaded to datasets within an organization, where every version of that dataset is through the organization's . DOIs will always resolve to the URL of the . In the case when a dataset is restricted or deleted, base metadata will remain available.

Redivis uses highly-available and redundant Google Cloud infrastructure to ensure data is stored and to the highest technical standards. Redivis maintains a formal disaster recovery and business continuity plan that is regularly exercised to ensure our ability to maintain availability and data durability during unforeseen events.

All Redivis datasets contain extensive and . Some metadata fields (including for all variables) are automatically generated. Some fields are optional depending on the editor's insight and preference. Every dataset has space for short and long-form , , and , alongside variable .

Metadata is available in various machine readable formats, such as schema.org and DataCite JSON, and can be viewed through the interface or downloaded via the .

All datasets on Redivis are owned either by an (curated by any administrator) or an individual . Additional users may be added as editors to a dataset, so as to provide further curation and quality assurance.

and can be explored on Redivis without any requirement to have an account or log in. If someone would need to apply to , or they will need to make an to do so. All individual accounts are complete free and require no specific affiliation. Data access restrictions are set and maintained by the data owner.

The data owner can choose to publish any dataset publicly or set appropriate based on the sensitivity of the data. Redivis imposes no additional limits on the availability and reuse of data. Any dataset that has a DOI can be cited and tracked in publications, and any reuse on Redivis is displayed on the dataset's usage tab.

Dataset owners can describe any usage agreements in their and have space to document any additional usage rules. Redivis imposes no additional limits on the availability and use of data.

Redivis is SOC2 certified and prioritizes technical . There are multiple layers of administrative controls to make it clear what actions data owners are taking, and all actions whether administrator or researcher are automatically for review.

Redivis is SOC2 certified and prioritizes technical . All data is encrypted in transit and at rest, and stored on Google Cloud infrastructure that maintains robust technical and .

Data and metadata on Redivis can be imported and exported in . Data analysis is performed in common, generally open-source programming languages (SAS and Stata being available exceptions). Redivis does not introduce any of its own proprietary formats or programming languages.

All datasets contain documentation, which is automatically populated based on administrator actions. This information can be further edited or supplemented with additional related identifiers.

All modifications to a dataset or tracked in the .

Redivis publishes a . Data is always owned by the user or organization who uploads the dataset, and they have control over a dataset's presence and availability. The dataset owner may apply additional policies towards data retention.

Redivis provides extensive options to on sensitive or restricted data, including allowing access on different levels (e.g. metadata, sample, data). Access is granted and revoked on Redivis instantly, allowing for immediate changes in access based on changing circumstances.

Redivis has built in data available to administrators. Data download or export can be restricted completely or limited only to administrator approval in order to prevent redistribution.

Data administrators can also communicate and collect formal acknowledgement of other use restrictions through . Moreover, data administrators can easily audit the use of restricted data in the to further check for and limit any non-compliance or other misuse.

Redivis has a built-in that allows administrators to restrict access to sensitive data. These controls apply to data derivatives as well, where any data output inherits the access rules of the source dataset(s) used to create that output. These controls on derivative data allow researchers to without worrying about accidentally leaking sensitive information, since all collaborators will need to comply with the access rules in order to view those outputs.

Redivis has built in data available to administrators. Data download or export can be restricted completely or limited only to specific external systems / upon administrator approval. All data downloads are logged for subsequent audit and review. Downloads are only available to authenticated users who have access to the underlying data.

Violations of the may lead to account suspension or revocation, as outlined in the terms. Additional technical controls are in place to prevent abuse or misuse of Redivis's systems.

Administrators can define access on any restricted dataset. These access requirements are transparent to all users, and researchers must and be approved for a given set of requirements in order to gain access. Requirements also have space for both administrators and applicants to leave specifically in the context of the data application.

This workflow is on Redivis
Example Workflow Dataset
Rename
Create variables
Join
Join
Looker Studio Help pages
This workflow is on Redivis
GHCN Daily Weather Data
Filter
Create variables
Aggregate
Join
Notebook
Plotly
Global Historical Climatology Network dataset
Fine tuning a Large Language Model (LLM)
Example workflow, Fine tuning a Large Language Model
Python notebook reference
R notebook reference
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
Joining
Creating variables
Filtering
Renaming
changing their type
Aggregating
Reshape data in transform
compute resources
compute credits
Work with data in notebooks
Share your workflow
various file formats
Python / R
Google Data Studio
Export to other environments
Example workflows
Export to other environments
workflow
dataset creator
transform your data
use notebooks
Create & manage datasets
Build your own site
Stanford
Columbia
Apply to access restricted data
Work with data in a workflow
released a document
including the NIH
securely
Organization
User
Datasets
workflows
access restricted data
analyze and download data
account
access restrictions
access requirements
security
logged
security
physical access controls
adminstrative logs
data retention policy
limit access
restrictions
tiered access system
share analyses with colleagues
restrictions
terms-of-use
Unique Persistent Identifiers
Long-Term Sustainability
Metadata
Curation and Quality Assurance
Free and Easy Access
Broad and Measured Reuse
Clear Use Guidance
Security and Integrity
Confidentiality
Common Format
Provenance
Retention Policy
Fidelity to Consent
Restricted Use Compliant
Privacy
Plan for Breach
Download Control
Violations
Request Review

Fine tuning a Large Language Model (LLM)

This guide demonstrates using a Redivis workflow to import an existing LLM and then use relevant data to fine tune it and run it on another similar set of data we are interested in.

Workflow objective

Here, we want to fine-tune a pre-trained "foundational" LLM so that it can be used to score reviews. We will leverage an existing dataset that contains a collection of Yelp reviews and their scores to perform the fine-tuning, and then apply this classification model to other reviews (from Reddit) that do not contain an accompanying score. The goal here is to demonstrate how Redivis can be used to leverage, modify, and ultimately apply state-of-the-art LLMs to novel data.

1. Choose and explore data

Yelp reviews

To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need.

It looks like there are two tables, one with reviews for testing a model and another with reviews for training a model. Clicking on each table in this interface shows that they both have two variables (label and text) and that the Train table has 650,000 records while the Test table has 50,000 records.

This data seems to be formatted exactly how we'll want to use it so we don't need to do additional cleaning .

Reddit

For this workflow, we just want to look at reviews from one specific subreddit which reviews mouse traps: MouseReview. If we click on the Subreddit variable name, we can see a searchable frequency table with all this variable's values. If we search MouseReview we can see that this dataset contains 26,801 posts.

To move forward with this workflow we'll want to train a model on the Yelp dataset, and filter and clean the Reddit table make it more usable with our model. In order to clean or transform data and do our analysis we'll need to create a workflow.

2. Identify a base model to fine-tune

This model is hosted on Hugging Face, so we could load it directly into our notebook at runtime. However, if our notebook uses restricted data, it might not have access to the external internet, in which case we'll need to load the model into a dataset on Redivis.

3. Create a workflow

At the top of any dataset page, we can click the Analyze in workflow button to get started working with this data.

You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.

Add the additional datasets by clicking the + Add data button in the top left corner of the workflow and searching for the dataset by name.

4. Create a notebook and load model + data

Once we've added all our datasets to the workflow, we can get started. To begin, we'll create a python notebook based on the Yelp reviews training data, by selecting that table and clicking the + Notebook button.

We'll also need to install a few additional dependencies to perform training and inference via Hugging Face python packages:

The general steps here are as follows:

  1. Load the training and test data from the Yelp reviews dataset

  2. Load the pretrained BERT model

  3. Train this base model on the Yelp data to create a fine-tuned model that can classify text reviews from a score of 0-4.

5. Prepare the Reddit data for inference

The Yelp data was ready to go as-is, with a simple text field for the review and integer value for the score. For the Reddit data, we just need to run a quick filter to choose posts from the appropriate sub-reddit.

Create a transform

Click on the Posts table in the Reddit dataset and press the + Transform button. This is the interface we will use to build our query.

The final step in a transform is selecting which variables we would like to populate the resulting output table. In this case we just need the variables title and selftext.

With everything in place we will run this transform to create a new table, by pressing the Run button in the top right corner.

7. Use the fine-tuned model to classify Reddit reviews

Finally, we can apply our fine-tuned model to the subset of Reddit comments that we want to analyze. Ultimately, we produce a single output table from the notebook, containing the reddit post and associate score generated by our model.

Next steps

We do recommend further familiarizing yourself with the examples and detailed documentation to take full advantage of the capabilities of Redivis notebooks:

Build your own site with Observable

Overview

1. Identify your Redivis data

2. Build your Observable project

3. Load data with a client library of choice

To successfully authorize data fetching functionality in your development environment, be sure to export a REDIVIS_API_TOKEN variable to your path, with the following terminal command: export REDIVIS_API_TOKEN = 'MY_API_TOKEN'

IMPORTANT: API access tokens operate like passwords, and can allow another party to access resources on your behalf.

Writing a "data loader"

Using the redivis-js client library

Using the redivis-python client library

python3 -m pip install --upgrade pip
pip install -r requirements.txt

4. Visualize data

5. Deploy and share your site

With our dashboard finalized, you can deploy your project anywhere on the web. Observable provides a quick command (npm run build) to generate static files to be hosted on a server of your choice, as well as a deployment service if you'd like to host with Observable.

Deploying to Observable

Deploying to Github Pages

Github Pages provides a hosting service that's easy to integrate with your Github repository and any Github Actions needed to deploy your project.

In our example, we deployed our project to Github Pages in a few simple steps.

First, we need to specify a REDIVIS_API_TOKEN to support Redivis client library authorization in the live deployment. We used "action secret" within the Settings of the Github repo, which sets an environment variable REDIVIS_API_TOKEN to our appropriately scoped token, which is completely private while our repository remains public.

  • Checkout the code

  • Set up a Node.js environment, with a specific node version (20.6)

  • Install our python dependencies (via pip,specified in requirements.txt)

  • Build the static files for the site

  • Deploy the files (in the generated dist folder) to Github Pages via a specific branch (gh-pages)

Create an image classification model

This guide demonstrates using a Redivis workflow to train an ML model on a set of images stored in a Redivis dataset.

Workflow objective

This is an example workflow demonstrating image classification via Convolutional Neural Networks. It imports an example dataset containing several thousand test and training images of cats and dogs, with which we can train and evaluate our model.

This workflow is heavily adapted from its initial publication at: https://gsurma.medium.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8

1. Explore data

We can go to this dataset to browse it's tables to understand the structure of the data it contains.

We see three tables here, and all of them are file index tables. That means that each table contains an index of the files (unstructured data) this dataset contains, sorted by the folder the administrator uploaded them into. We can click on the Files tab of the dataset to see each file individually, and click on it to see a preview.

This dataset has three groupings of files:

  • Training images (we will use these to build the model)

  • Test images (images not included in the training set that we can verify the model with)

  • Example file types (unrelated to this workflow)

If we click on the Tables tab, and click on the training images table, we can see high level information about this set of files. We can see that there are 25,000 files, and when we click the Cells tab, all of the file names we can see end in .jpg. We can hover on these to see a preview of the image, and we can click on the file_id variable to see a preview of the image with more information.

2. Create a workflow

At the top of this dataset page we can click the Analyze in workflow button to get started working with this data.

You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.

3. Define a training set of images

We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with.

Define training set

We need to start by defining the training set, which conceptually means the set of images we know are cats and know are dogs to train the model on. Information about whether an image is a cat or dog is in the file name, so we need to pull it out into a new variable we can more easily sort on.

We want to include only some of our training set images into the training set we use to train the model, since we want to leave some aside to validate the model. So here we want to include exactly 5000 cat images and 5000 dog images. So we we will create a new variable rank and filter on it so that we only keep the first 5000 images of each type.

The final step in the transform is deciding which variables we want in our output table. We will keep our new boolean variable is_cat to use later, along with the file_id and file_name

With everything in place we can run this transform by clicking the Run button in the top right corner.

4. Sanity check the output table

Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.

Click on the output table below the transform to view it. We can see that it contains 10,000 records which is exactly what we expected

We can also inspect each variable further. If we click on the is_cat variable we can see that there are 5000 true values and 5000 false values, which shows that our filtering was successful. We can also validate that the method we used to determine if an image is a cat or a dog worked by clicking on the Cells tab. Here we can see that records marked True have "cat" in their file name, and when we hover on the file_ID value to see a preview, the image clearly contains a cat.

Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.

5. Define validation set of images

We need to create a set of image files separate from our training set where we know if the image contains a cat or dog. This will be used to validate the model training.

Create a new transform and take all the same steps as we did in the previous transform, but we will change the filter to keep images ranked 5001-7500, rather than 1-5000.

We will keep the same variables as we did in our training model, and then run this transform.

When we run this transform and inspect the output table we see what we expect here as well. There are 5000 total files and we can validate a few of them visually on the Cells tab.

6. Training the model in a notebook

When you create the notebook for the first time it will start up. Notebooks must be running to execute code.

Install packages

Redivis notebooks come with many common packages preinstalled, and you can install additional packages by clicking the Dependencies button and importing libraries in the code.

Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.

import keras
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential, Model
from tensorflow.keras.optimizers import RMSprop
from keras.layers import Activation, Dropout, Flatten, Dense, GlobalMaxPooling2D, Conv2D, MaxPooling2D
from keras.callbacks import CSVLogger
from livelossplot.keras import PlotLossesCallback
import efficientnet.keras as efn
import redivis
import os

Load training and validation sets

Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook. For this example we will remove this sample code to import data according to our library's parameters.

TRAINING_LOGS_FILE = "training_logs.csv"
MODEL_SUMMARY_FILE = "model_summary.txt"
MODEL_FILE = "cats_vs_dogs.h5"

# Data
path = f"{os.getcwd()}/cats_and_dogs/"
training_data_dir = path + "training/"
validation_data_dir = path + "validation/" 
test_data_dir = path + "test/" 

Define model parameters

This is where we will heavily rely on our selected libraries to build the model.

# Hyperparams
IMAGE_SIZE = 200
IMAGE_WIDTH, IMAGE_HEIGHT = IMAGE_SIZE, IMAGE_SIZE
EPOCHS = 20
BATCH_SIZE = 32
TEST_SIZE = 30

input_shape = (IMAGE_WIDTH, IMAGE_HEIGHT, 3)
# CNN Model 5 (https://towardsdatascience.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8)
model = Sequential()

model.add(Conv2D(32, 3, 3, padding='same', input_shape=input_shape, activation='relu'))
model.add(Conv2D(32, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(64, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(64, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(128, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(128, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(256, 3, 3, padding='same', activation='relu'))
model.add(Conv2D(256, 3, 3, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
            optimizer=RMSprop(learning_rate=0.0001),
            metrics=['accuracy'])

with open(MODEL_SUMMARY_FILE,"w") as fh:
    model.summary(print_fn=lambda line: fh.write(line + "\n"))
# Data augmentation
training_data_generator = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.1,
    zoom_range=0.1,
    horizontal_flip=True)
validation_data_generator = ImageDataGenerator(rescale=1./255)
test_data_generator = ImageDataGenerator(rescale=1./255)
# Data preparation
training_generator = training_data_generator.flow_from_directory(
    training_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=BATCH_SIZE,
    class_mode="binary")
validation_generator = validation_data_generator.flow_from_directory(
    validation_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=BATCH_SIZE,
    class_mode="binary")
test_generator = test_data_generator.flow_from_directory(
    test_data_dir,
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=1,
    class_mode="binary", 
    shuffle=False)

Model training

This is where we will train the model we just built using the image data we cleaned.

# Training
model.fit_generator(
    training_generator,
    steps_per_epoch=len(training_generator.filenames) // BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=validation_generator,
    validation_steps=len(validation_generator.filenames) // BATCH_SIZE,
    callbacks=[PlotLossesCallback(), CSVLogger(TRAINING_LOGS_FILE,
                                            append=False,
                                            separator=";")], 
    verbose=1)
model.save_weights(MODEL_FILE)

Evaluate model results

Now we will use the validation set to see how well our model works

# Testing
probabilities = model.predict_generator(test_generator, TEST_SIZE)
for index, probability in enumerate(probabilities):
    image_path = test_data_dir + "/" +test_generator.filenames[index]
    img = mpimg.imread(image_path)
    plt.imshow(img)
    if probability > 0.5:
        plt.title("%.2f" % (probability[0]*100) + "% dog")
    else:
        plt.title("%.2f" % ((1-probability[0])*100) + "% cat")
    plt.show()

Next steps

Export to other environments

Overview

Redivis workflows contain powerful tools to reshape and analyze data but if you prefer to export data into a different workflow you can easily do so. Redivis systems use open source tools and common formats to make the transition as easy as possible.

1. Check export restrictions

Some datasets on Redivis have export restrictions which prohibit or limit removing data from the system.

You can check this by going to the dataset page and clicking the top right button Manage access, or by right-clicking on a dataset in a workflow and selecting the View access option.

The bottom section of this access modal defines any export restrictions in place. It might not have any restrictions, might be completely restricted, or might have some limited options for export.

If there are limited options for export, this section will detail the available locations and any restrictions on using those (such as the size of the table being exported). These restrictions are enforced automatically when you try to take actions within the system.

If there are no options for export, you still have the option to work with your data in a Redivis workflow, where you can reshape and clean your data using transforms, and then analyze it in notebooks.

2. Prepare your data

If you would like to export the entire table, you can skip this step.

Otherwise, we recommend that you use transforms to cut your data down to a smaller size and reshape it into the table format you need for analysis before exporting it. Especially for large datasets, Redivis tools are created specifically for these purposes and work seamlessly in the browser with no additional setup.

3. Export data

You can open the Export modal to initiate an export from any table. You can do this on the dataset page by right clicking a table and selecting the Export menu option, or by right clicking on any table in a workflow and selecting the same option.

Here you can see all the options available for your table to export.

If your table has export restrictions set by the data owner, options will be disabled here. In the case where your table does not meet export requirements but the data owner allows exception requests, you will see the option to do so here.

Download

The first tab of this modal gives you different options for formats to download your data. Most common data formats are supported, including csv, json, avro, parquet, SAS, Stata, and SPSS. Select the format you'd like and click the Download button to start downloading the file.

Programmatic reference

You can reference this file from your computer or another computational environment using the Redivis Python and R libraries. This modal gives specific information on how to reference this dataset, and you can reference our docs for other options.

Integrations: Google Looker Studio

This is a free Google dashboard visualization program you can directly link your table to. Use their point and click interface to build common visuals that will update as the underlying table updates.

Integrations: Google Cloud Storage

You can export your table directly into a Google Cloud Storage bucket that you have access to.

Integrations: Google BigQuery

You can export your table directly into a Google BigQuery project that you have access to.

Next steps

Cite datasets in your publications

If the work you're doing leads to a publication, make sure to reference the dataset pages from datasets you've used for information from the data administrators on how to correctly cite it.

Create and populate a dataset

Overview

Datasets are a core component of Redivis. Consisting of documentation, metadata, and tables, datasets allow you to store, version, and distribute a wide variety of data.

Anyone with a Redivis account can create a dataset in their workspace, and organization administrators can upload one to an organization via the administrator panel.

1. Create the dataset

All datasets must have a name (unique to datasets for the user / organization).

You can set up your dataset in whatever order you'd like, but we recommend following the order below when getting started.

2. Upload data

This data might be in a tabular format (.csv, .tsv, .sas etc.) or more rarely in unstructured data such as images and text files.

Tabular data

If you haven't already worked with data in a workflow, we strongly recommend exploring that before creating dataset so you can understand how researchers will work with your data.

When you're ready to upload data, we have broken that out into a separate guide.

Unstructured data

For unstructured data, go to the Files tab where you can upload files from your computer or another location your data is stored via an integration. You can put these into folders and create index tables to better keep track of.

Note that any files uploaded here can't be transformed in the workflow tool or queried across Redivis (which require the table format).

Make sure any files you upload here contain this dataset's data. Any files with information about the data (such as data dictionaries or usage guides) should be uploaded as documentation on the Overview tab.

3. Edit metadata

It's easy to feel "done" after uploading your data, but documentation and metadata are essential to the usability of your dataset. Moreover, rich metadata will improve the discoverability of your dataset by providing more information and terms to the Redivis search engine.

Metadata can always be updated after your dataset has been released. While good metadata are essential, it can be a time consuming and iterative process, so you might prefer to provide some basic content initially, and then improve it over time.

Dataset metadata

On the overview tab of the dataset, you can provide an abstract, detailed documentation blocks, supporting files and links, and subject tags for the dataset.

Table metadata

Variable metadata

4. Create a sample

If your dataset is particularly large, or if you want to control access to a sample of the data separately from the whole dataset, you should configure sampling on your dataset. This will allow researchers to work with a 1% sample of the data during initial exploration, and allow you to grant access to the sample independently of the full dataset.

To update the dataset's sample configuration, click on any table, and then click "Configure sample". When configuring the sample, you can generate a random sample for each table, or sample on a particular variable that is common across tables. If researchers will be joining tables across your dataset, it is highly recommended that you sample on that common join variable so that researchers can contain a consistent 1% sample as they work with your data.

5. Configure access

Before releasing your dataset, it is important to define who can access the dataset and what the procedures are for applying and gaining access. Click the Configure access button on the top of the page to set up the access configuration.

Datasets owned by organizations have more options for access than datasets owned by users.

Access levels

Dataset access has five levels:

  1. Overview: the ability to see a dataset and its documentation.

  2. Metadata: the ability to view variable names and summary statistics.

  3. Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.

  4. Data: the ability to view and query a dataset's tables, and work with them in workflows.

  5. Edit: the ability to edit the dataset and release new versions.

Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.

Usage rules

Even with data access, you may want to limit what other users can do with your dataset. Currently, you can configure export restrictions that limit:

  • The download location (e.g., to prevent researchers from downloading to their personal computer)

  • The download size, in bytes and/or rows

  • Enforce admin approval before any export

Editors

You may also add additional dataset editors to help upload data and provide metadata content. These editors will be able to create and release new versions, and will have full access to the underlying data, though they cannot add other users, modify the access configuration, or bypass the dataset usage rules.

If the dataset is hosted by an organization, all administrators of the organization will be able to the edit the dataset as well as its access configuration.

Permission groups

5. Release the dataset

Congratulations! Your dataset is ready to be released and utilized by the research community. But first, it is highly recommended that you validate and audit your dataset beforehand. Take a look at the number of rows, variables, and uploads in each table. Validate some of the variable summary statistics against what you expect. And to be truly thorough, add the dataset to a workflow and run some queries as if you were a researcher. Catching a mistake now will prevent headaches down the line if researchers uncover unexpected discrepancies in the data.

Once a version has been released, the data can no longer be edited. While you can unrelease a version within 7 days, this should generally be avoided; you'll need to release a new version to modify the data.

When you're confident that you're ready to go, click the "Release" button on the top of the page. If the button is disabled, hover over it to understand what issues are currently preventing you from releasing.

After clicking the button, you'll be presented with a final checklist of tasks. When you click the Release version button, the dataset will be immediately released and available to all users with access.

This dataset is now also considered Published. If you need to pause all activity and access to this dataset, you can return to this page in the future and Unpublish it temporarily.

6. Make updates as new versions

How to work with versions when updating a dataset:

  • Any edits to the data content in tables will need to be released as a new version.

  • Edits to the dataset information, table information, or variable metadata can be made on the current version (or historic versions) and will be live as soon as it's saved.

  • Edits to the dataset name and access configuration will always affect all versions.

Creating the next version

All data within a dataset is encapsulated in discrete, immutable versions. Every part of the dataset except for the name and access settings are versioned. All tables in a dataset are versioned together.

After releasing the first version of the dataset, you can choose to create a new version at any time by clicking the button in the top right "Create next version". This version will be created as vNext, and you may toggle between this and historic versions at any time.

Subsequent versions always build on the previous version of the dataset, and changes made in the next version will have no affect on previous versions. Alongside modifications to the dataset's metadata, you may create, update, or delete any of the previous version's tables.

Replacing vs appending data

When uploading data to a previous table, you can choose whether you want to append these new uploads to your existing data, or replace the entire table with the new data.

Version storage costs

Redivis computes row-level diffs for each version, efficiently storing the complete version history in one master table. This allows you to regularly release new versions and maintain a robust version history without ballooning storage costs.

Next steps

Start working with your data

Once your dataset is released, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Upload unstructured data as files

Overview

Any type of file can be uploaded to a Redivis dataset, and we support previews for the most common file types. These data can be analyzed in notebooks within workflows.

Note that if you have tabular data we strongly recommend uploading it as a table (rather than a file) so you and your researchers can take advantage of our extensive toolkits for previewing and manipulating tabular data.

1. Locate the data you want to upload

If importing, you'll want to get the relevant external account configured to your Redivis account before getting started.

The import tools allow for multiple uploads, so it's helpful to have them all in the same place.

2. Upload files

When you're ready click the Upload files button on the Files tab of the dataset page to get started and select the location your files will be coming from.

If you select "Computer" then you will get a browser window where you can choose the file, files, or folder you want to upload from. Choosing a different location will bring up the corresponding mechanism to select your files.

Click the "Upload" button to start the process. If you are uploading from your computer you'll need to wait on this screen until it's complete. Closing your window will end the upload process. If uploading from a linked source, you can close this window to allow this process to continue in the background.

3. Manage files and folders

You can click on the file name in the Files tab to preview the file and see its information.

You can also go to the Tables tab to view the index table for the folder you've uploaded files to. View the cells and hover or click on the file ID variable to see the preview here. This table can be used in Transforms within a Workflow to work with files on a large scale.

If you want to rename a folder you can do so on the Files tab by right clicking on the folder name in the right bar.

You can also create new folders and move files between two existing folders from this bar. When moving between folders, you can use a conditional statement to only move some files that match your conditions.

Next steps

Continue uploading your dataset

Great metadata makes your dataset useable. Complete your metadata, along with configuring access, creating a sample, and releasing this version.

Upload tabular data as tables

Overview

Redivis offers extensive tools for previewing tabular data and transforming it in workflows, but the data needs to be uploaded correctly in a tabular format for researchers to utilize these tools.

1. Locate the data you want to upload

If importing, you'll want to get the relevant external account configured to your Redivis account before getting started.

The import tools allow for multiple uploads, so no need to combine files together before importing them, but it's helpful to have them all in the same place.

2. Create tables

On your newly created dataset, the first step to uploading data is to create one or more tables that the data will be uploaded to.

The data files you currently have may or may not be how you want to store them on this dataset, so it's important to think about your data's structure before getting started.

For example, if you have multiple files that all follow the same schema, we strongly recommend uploading them as one table (for example, if you have a separate table for each year, or a separate table for each state, but the structure of each is the same). In the example of one file per state, this would allow researchers to query across all states, skipping the first step of doing up to 50 joins. Additionally, you generally shouldn't split out tables for performance reasons — even when querying billions of records, Redivis will execute in seconds.

When you're ready click the Create new table button on the Tables tab of the dataset page and name your table to get started.

3. Upload tabular file(s) to create a table

To get started uploading, choose the data source. By default this is your computer, but you can choose any option from the dropdown menu.

Next, choose the file(s) or enter the paths of the file(s) you want to import.

If you select multiple files here, they will be automatically appended in this single table on upload based on common variable names. If a variable is missing in some of the files, that's ok, it will just be recorded as null for all records in that file.

Once your files are selected, click the Import button. If the files are coming from your computer, you might need to wait until they are finished uploading to the browser before they can be imported into Redivis.

4. Verify uploads

As you upload files, you will see an overview of any files' progress and can click to view each file's data and additional information.

Once all uploads have completed, you can inspect the table (representing the concatenation of all of your uploads). Make sure to check the summary statistics and other analytical information to validate that the data are as you expected.

If you have more files to upload you can click the Manage imports button on the right side of the table at any time (up until releasing this version of the dataset).

Next steps

Continue uploading your dataset

Great metadata makes your dataset useable. Complete your metadata, along with configuring access, creating a sample, and releasing this version.

Administer an organization

Overview

Organizations are where administrators centrally manage datasets and who can access them. All organization administration happens from the Administrator panel which administrators can access from the Organization home page.

1. Add other administrators

You can add any Redivis account to be an administrator of your organization in the Administrator panel from the Settings tab.

Note that all administrators will have full access to the organization and its datasets, including the ability to modify access rules, approve access requests, and add or remove other administrators, including you.

2. Build your organization's presence

To help brand your organization and best communicate the organization’s purpose to others, you'll want to fully populate and brand the Organization home page.

In the Administrator panel on the Settings tab / Public profile section you can customize:

  • The organization's full name

  • An organization description which will appear at the top of your about section, and in any institution searches

  • A custom brand color

  • The organization's logo

  • The data portal's cover photo

  • A header image that will appear in the top left of users' windows when on your data portal

  • Header links to other resources, such as organization-specific documentation, events, and other resources.

  • Rich text and images that will display in the main content section of the data portal. A place to provide high level information about your organization and how researchers should user your data portal.

3. Create datasets

You'll need to create datasets to upload and distribute data to your researchers. You can create datasets from the Datasets tab of your Administrator panel.

Here are some pieces to consider before getting started:

Dataset structure

It's worth thinking through your broader data management strategy. Assess the data that you currently have — what format are the files in? Where are they located? How are they organized?

Consider what datasets you'll want to create. Datasets on Redivis are made up of one or more semantically related tables, and each table can contain data from one or more uploaded files. Moreover, all permissions happen at the dataset level — if you grant a user access to a dataset, they will have access to all of its tables.

Citations and DOIs

If you’d like to issue DOIs for your organization’s datasets, you can configure your organization to issue DOIs in the Advanced section Settings tab. DOIs will make it easier for researchers to cite your data when they publish, and allow you to track publications using your data as well as their broader impact.

4. Build access systems

Before you can configure access to a dataset, you'll want to consider and build out the relevant access systems for your organization as a whole.

Direct access & Member access

If you have a smaller organization where you know everyone working with your data, you might want to consider directly granting access to datasets to specific members, or granting access to everyone who is a member.

This can be set up on any dataset individually or you can create a Permission group on the Permission group tab to save the entire configuration of access levels and assign it to multiple datasets.

Requirement-based access

In many cases, however, you’ll want to develop more process-driven access control systems and gather additional information from your members To do so, you can set up requirements that must be completed by members or their studies, and approved by an administrator, before gaining access.

You can build requirements on the Requirements tab of the administrator panel, and save groups of them together as Permission groups on the Permission groups tab which you can assign to multiple datasets.

Export restrictions

Finally, you may want to control how your data is exported from Redivis, if at all. While members with data access will always be able to work with data in Redivis workflow, you can limit their ability to export data. For example, you may choose to allow exports only to certain environments, or only upon administrator approval, or not at all.

5. Manage members and studies

As members and studies apply to access data, you may need to take action to approve (or reject) their access requests. You can find all of your pending access requests wherever you see an alert on the administrator panel.

Members

As users join or organization and apply to work with data, they will appear in the Members tab of your administrator panel. You can also invite anyone to become a member by adding them to the members list.

Studies

Studies allow for researchers to organize their work around a conceptual study, or research project, and for you to grant access in the context of that study. Any studies that are working with your organization's data will appear in the Studies tab of the administrator panel automatically, and you can approve any requirements for studies here.

To be alerted to new access requests, you should configure a notification email (and frequency) from the Settings tab of the administrator panel.

6. Gain insight into usage

Redivis provides tools to help you keep an eye on your organization and how its data is being utilized.

You can filter, sort, and change the columns for any list in your administrator panel, and download the corresponding report for a snapshot of current information. You can also dig into the logs and generate analytical reports for more information.

Next steps

Expand your reach

Continuous enrollment

Many insurance claims datasets on Redivis contain information about enrollment, detailing the periods of time when an individual represented in a dataset was covered by an insurance plan. If you intend to characterize patients based on their insurance coverage (or lack thereof) during certain key events (procedures, diagnoses, etc), it's often important to identify periods of continuous enrollment for each individual – and capture each continuous enrollment period for each patient in a single row.‌

These claims datasets describe enrollment information in multiple discrete rows per patient, each corresponding to patient per month. However, an overall continuous enrollment period may be broken up across rows into months or other non-uniform chunks, so we'll employ the following process to combine multiple sequential rows into a single row with one start date and one end date, describing one continuous period.‌

We want to create a final table with 3 rows for patient 1 to account for gaps in enrollment in March and May of 2012, and 1 row for patient 2. Our desired output has a row for each continuous period per patient, shown below:

The variable names in this example are not actual MarketScan variable names. But, with appropriate data access, you can see the real variables used in the first transform of the Redivis example workflow by hovering over the (renamed) variables patient_id, enrollment_start_date, and enrollment_end_date in the Keep section.‌

(1) Add start and end of adjacent periods to each row

Throughout this example, we'll create variables that partition the dataset by patient identifier (here, patient_id) to ensure that each patient is processed individually. But, to account for the fact that a single patient may have many periods of continuous enrollment, each spanning many months, we need to identify the correct start (from enrollment_start_date) and end (from enrollment_end_date) of a continuous period out of multiple rows and capture them in a single row.‌

First, we create a partition variable lag_end_date using the lag method, which will order each row by enrollment_start_date in a given patient_id partition and copy the previous row's enrollment_end_date value into the each row. We also create lead_start_date using lead, to copy the following row's enrollment_start_date value into each row.‌

These methods generate values which tell us how close the preceding and following enrollment periods are with respect the each row's enrollment period.

(2) Select rows with start and end of continuous periods

In a second, downstream transform we'll create new variables, which will use the above lead and lag values to identify which rows correspond to the beginning and end of a continuous enrollment period.‌

First, we'll compare enrollment_start_date and lag_end_date to find the difference (in days) of the start of each period and the end of the previous period. and compare enrollment_end_date and lead_start_date to find the end of the period.‌

We see via diff_lag_end_enrollment_start (created using the date diff method) which rows describe an enrollment period directly following the previous period, and which rows describe an enrollment period with a larger gap since the previous period. We also create diff_enrollment_end_lead_start to identify gaps between an enrollment period and the next period.

Next, we'll encode booleans from our difference variables to simplify further filtering. We'll identify rows corresponding to the start of a continuous period as those with a diff_lag_end_enrollment_start value of either NULL (the row is the first period in the partition) or greater than 1 (the row comes after a gap in enrollment). And we identify rows corresponding to the end of a continuous period as those with a diff_enrollment_end_lead_start value of either NULL (the row is the last period in the partition) or greater than 1 (the row comes before a gap in enrollment).‌

This leaves us with either 1 or 2 rows corresponding to a continuous enrollment period. If a continuous period spans multiple rows (months, in this case), we'll have 2 rows (a start row and an end row). But if a period only spans one row, we'll have both start and end captured by that 1 row. In our example, the row containing the middle (neither start nor end) of the 2012-06-01 to 2012-08-31 enrollment period for patient 1 was dropped. We can also ignore our intermediate lead..., lag..., and diff... variables, since our final processing step will only consider a row's is_start_continuous_period and is_end_continuous_period values.

(3) Add end date of following enrollment periods to each row and collapse

Finally, we want to collapse our table to ensure 1 row per continuous enrollment period per patient. Since we have only rows corresponding to start and end of continuous periods, we create another partition variable lead_end_date (in the same transform is fine, since this step will happen after the previous filter) which copies the enrollment_end_date value of the following row on to each row.‌

We can also use the partition row filter to keep only rows with is_start_continuous_period as true, since our lead_end_date has copied over the end date of the continuous enrollment period, contained in each row's following row.

We end up with a table where each and every row contains both the start date of the continuous enrollment period and the end date of that continuous enrollment period.

A final processing step captures the correct end date of a continuous period. We now have only rows whose enrollment_start_date value contains the start of a continuous period, but these rows fall into two categories:‌

  • First, we have the rows that do not also correspond to the end of an enrollment period (where the is_end_continuous_period value is false). For these, we want to look at lead_end_date, the end date of the next row, which represents the final date of the continuous period, since we filtered out all the intermediate rows above.

  • Second, we have which also correspond to the end of an enrollment period (where the is_end_continuous_period value is true) – in this example, if the continuous period was only 1 month. For these, the row defines a period (in this example, 1 month) that also contains the end date, so we just get the enrollment_end_date from that same row. Note that the lead_end_date value is incorrect in this case, since the next row contains the start of the next continuous period, or NULL if the period falls at the end of a partition.

We capture the above logic in a new variable enrollment_end_date_continuous, created in an additional transform, since our previous operation involved a partition.

We end up with a final table below, containing the patient identifier, and the start date (renamed to enrollment_start_date_continuous for consistency, and end date of each continuous enrollment period.

Configure access systems

Overview

One of the main tasks of organization administrators is to manage who has access to datasets. Redivis has tools to support any level of complexity in data systems for your restricted data and makes it easy for organization administrators to define, manage, and audit access rules across these various use cases.

1. Consider your organization's access needs

Do you mostly have public datasets? Will you have a small group of researchers with access to pretty much every dataset? Will you need to gather information from researchers before they use your datasets? Do you have high risk data that can't ever leave the system? Or some combination of the above?

Redivis supports access systems for all of these configurations – for organizations of varying sizes – but you'll want to use different tools to achieve these very different goals.

For example, if you have few datasets, you might find that permission groups for bulk managing access rules aren't particularly helpful. Or if you have a small group of researchers who you all know closely, you might not need to set up process-driven access rules through requirements, and can instead grant access to certain researchers on an individual basis.

We'll walk through all the tools available to organizations for managing access, but you'll likely want to pick and choose from what's available to meet your organization's specific needs.

2. Understand access levels

Dataset access has five levels:

  1. Overview: the ability to see a dataset and its documentation.

  2. Metadata: the ability to view variable names and univariate summary statistics, but not to retrieve any identifiable information or multivariate relationships from the data.

  3. Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.

  4. Data: the ability to view and query a dataset's tables, and work with them in workflows.

  5. Edit: the ability to edit the dataset and release new versions.

Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.

We strongly recommend making your dataset's metadata as open as possible. This will reveal variable names and aggregate summary statistics, but will not allow researchers to view, query, or export the raw data in any way.

Being able to see metadata greatly improves researchers' discovery experience, and allows them to better assess a dataset's utility upfront, and even reduce your administrative workload. If researchers can understand a dataset before applying for access, they'll be submitting fewer access applications to datasets that are ultimately a dead end.

3. Understand available access tools

Membership

Anyone wanting to apply for access to restricted datasets in your organization must first be a member. You can configure whether memberships are restricted to certain identity providers (such as your institutional login), and whether they are approved automatically or require administrator review. You also have the option to configure access to datasets to "All members."

Direct access

Permission granted directly to a researcher to instantly gain access to a dataset at a specific level. Researchers can also request access to datasets with this configuration.

Example usage: a dataset that will only be shared with a small number of people who are already known to administrators

Member requirements

A form for members to fill out. This can be set to require approval from an administrator or be automatically approved. It can also have an expiration date. These are global to your organization and when assigned to multiple datasets a user will only fill it out one time.

Example usage: a demographic form gathering researcher personal information, or a data use agreement signed PDF

Study requirements

Similar to requirements, but instead of each user needing to fill them out individually, only one requirement needs to be completed for the entire study, which can include multiple researchers. A single researcher may also have multiple studies. Each study working with the dataset will need to fill out its own study requirement, and any queries or exports of the data will be tied to that study.

Example usage: a funding proposal for a research project

Data export restrictions

A rule defining that a dataset can only be exported to a specific export environment, as configured on the Administrator panel Settings tab.

Example usage: limiting exports to a specific server environment

4. Create requirements

If you want to work with requirements, you'll want to get started making them and planning out how they will work across datasets.

Perhaps you want one requirement for all members to fill out about their field of research, which is necessary to gain access to any of your datasets, but another 4 requirements with different data use agreements that will apply only to their specific datasets.

To get started, go to the Requirements tab of the administrator panel and click the New requirement button. You will need to start by selecting if this will be a Member requirement or a Study requirement.

You can use the form builder to collect different information, including standard form responses, file uploads, and e-signatures.

5. Create permission groups

A permission group is an access configuration that can be assigned to multiple datasets and managed centrally.

You don't need to use permission groups, and it might not make sense to do so if each of your datasets has a different access configuration and you aren't using requirements. But if you have any overlap between datasets and want to enforce consistency, or want to use requirements you'll want to make one.

To get started, go to the Permission groups tab of the administrator panel and click the New permission group button.

This interface requires you to set an access paradigm for each access level of the dataset.

Perhaps you will set the overview access level to be public, the metadata access level to be available to all members, and data access level to be direct access (meaning you will have to directly grant access to users, or respond to their requests for access).

Or perhaps you will set the overview access level to public, and assign multiple requirements to the metadata and data access levels (meaning that anyone who is approved for all of the requirements will automatically gain that access level).

For any case you can assign data export restrictions here and choose whether you want to manage access to the dataset's sample (if it exists) differently than the full data.

If overview access to a dataset isn't public, non-approved users will not be able to see the dataset or its name in any way. In some cases, this may be the intended behavior, but remember there will be no way for researchers to apply for these datasets.

Instead, for these hidden datasets, an administrator will need to first explicitly grant overview access before researchers can view the dataset and request further access.

6. Assign access permissions to datasets

Finally you'll need to apply these access permissions to actual data!

Open any of your datasets and click the Configure access button on the top of the page. This configuration setup will look very similar to configuring the permission group.

You can either create a custom configuration here, or you can assign this dataset to one of the Permission groups by clicking the dropdown menu in the top right corner of this modal.

This is also where you will manage any direct access requests for the dataset.

7. Verify your setup

As an administrator of this organization, you will have access to all datasets no matter what your access configuration is or what requirements you have filled out.

We recommend checking that your access system works as you expect by either looking at a dataset while you are logged out or in your browser's incognito mode, or by making a second Redivis account using a different email address that is not linked to your administrator account.

Next steps

Grant access to data

You have some shiny access systems, but they won't work if you don't approve user requests for access.

Expand your reach

Cleaning tabular data

Overview

Redivis datasets are a great place to host data for interrogating in a workflow, but you can also edit the underlying data.

Perhaps you found an issue with the source, or want to restructure it, before making it available to others. Redivis has all the tools you'll need to do this in a versioned, step-by-step process with transparency into the changes you've made.

1. Upload the raw data

2. Add this dataset to a workflow

Create a new workflow and add this dataset to it.

If you want to share the data transformation process with others for transparency, you can make this workflow public in the share modal. (Due to the way Redivis works, still only people with access to the underlying data will be able to see the data in the workflow even though the workflow is public.)

3. Transform data

Select the table you want to make changes to and create a new transform.

Example edit 1: Rename and retype a variable

Example edit 2: Recoding the values of a variable

Create a new variable with the same name. "Keep" this variable and "Discard" the original variable. Select the method as "Case (if/else)" and create the conditions you want to recode to.

Example edit 3: Add a new variable

This example shows a new variable with today's data as the "date_uploaded" but you could create any variable you want from the data in the table. Perhaps a aggregation that would be helpful to see with this data? Or the sum of multiple other variables?

You can transform this table however you want, using the Graphical interface or SQL code. Make sure to move all variables in this table from the "Discard" section to the "Keep" section except for any variables that have been replaced and will be left behind (in this case the store_and_fwd_flag from the source table).

Validate that this new table looks correct by looking at the output table below this transform!

Make changes to any tables in this dataset in this same wokflow.

4. Re-upload the finished tables

Go back to the original dataset, and create a new version.

Open a table that you made changes to in your workflow, and click to "Import data."

Choose your merge strategy as "Replace" since you want to replace the existing table with the new one you created.

Select the data source as "Redivis" and type in the information from the workflow you made your changes. The sequence here will be:

  1. Your user name

  2. The name of the workflow (underscores replace spaces)

  3. The table name (underscores replace spaces)

For example: username.testworkflow.table_name

Once it's done uploading you can validate that the data looks like it's supposed to.

Now you can close this table and repeat this process for any other tables in this dataset you've edited.

Once this is finished, you can add the link to the worfklow into this dataset's documentation or release notes along with a note on the changes. Remember that if you made the workflow public, anyone who has data access to this dataset can view the changes you made.

Redivis stores data as compactly as possible so if you are concerned with data storage costs, only any new records created will create more storage need. If you would like to delete the first version of the data you can also do that by opening the version modal in the dataset editor and clicking "Delete version" here.

Next steps

Start working with your data

Once your dataset is released, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Grant access to data

Overview

One of the strengths of Redivis is that researchers can apply for, and gain access to, data on the same page as where they discover the dataset. Moreover, all access applications are tracked, both for administrator's reference and to provide transparency and timely notifications to researchers.

Responding to access requests in a timely manner creates a great environment for researchers and takes full advantage of these automatic systems.

1. Configure notifications

You can configure multiple email addresses to receive an email when a request for access is made to your organization, as well as the frequency you receive them. Especially if you are a smaller organization that isn't getting access requests every day, we very strongly recommend that you have at least one administrator receiving emails so you know when to respond to a request.

2. Approve membership requests

Users must be approved as members of your organization before any other access requests can be approved.

When a membership request is pending you will see an Alert (!) on the Members tab of your workspace. Click on any member with an alert to see their pending submission awaiting approval.

Pending members will have an Approve / Reject button as soon as you click on their page, alongside the authentication information they submitted when requesting to be a member. It is important to verify that this person has a genuine email that seems consistent with who they say they are.

3. Approve requirement submissions

As users ask for approval you will receive an alert (!) in the administrator panel where your attention is needed.

Member requirements and export restriction exceptions will appear on the Member tab.

Study requirements will appear on the Studies tab.

Direct access requests will appear on the Datasets tab.

When all of the pending requests are resolved, the alert will disappear.

For each request you will have the option to approve or reject the submission. You can always come in later to revoke the approval if needed. You might also see that the member has submitted updates to their existing approval. In that case the approval is still valid while updates are pending, unless it is otherwise revoked or expired.

4. Leave comments for clarity

Both administrators and members have the option to leave a comment on any requirement submission. The act of leaving a comment will trigger an alert for administrators, or a notification for members.

Especially when you are rejecting or revoking a requirement submission, we recommend leaving a comment to let the member know what they did wrong, or what they still need to do to gain access.

5. Audit access

It's often a good idea to double check which researchers have access to your datasets.

In order to audit a particular member's access to your organization's datasets, double click on that member, and navigate to the Access overview tab. This will list all of your organization's datasets, and the member's corresponding access to each.

In order to audit all members that have access to a particular dataset, click the filter icon to the right of the member search bar. This will allow you to filter all members by their access to, and usage of, your organization's datasets.

Next steps

Expand your reach

Select first/last encounter

E.g.: I want the date of the first and last hospital admission for every patient in my table

Let's take an example table which contains all hospital visits in one facility over many years. Each record contains information about what happened in the visit, including a patient id to identify who was present. An individual patient might be represented in one or many records in this table since each record represents a separate hospital visit.

Let's say that for our research we want to find the first and last encounter — or more formally, the min and max encounter date for each patient.

In this example we will use a partition to create a new variable since we want the rest of the data to remain unchanged.

Variable concepts

Conceptually we will define our aggregation variable as the date of the encounter since that is where we want to look for the min and max values.

We will define our partition as the patient id since we want to find the min or max for each patient. This will limit the min or max value scan to all values of our aggregation variable across each unique value of our partitioned variable (patient id).

Defining the new variable

For this particular case, we want to find the MIN and MAX of the encounter_date of each patient; in other words, partitioned on patient_id . This will create two new variables, min_encounter_date and max_encounter_date for each patient.

When we run this query, note that no records will be dropped; rather, these new variables will be have consistent values for any given patient_id .

Now, let's say we only want to keep records which contain each patient's first and last encounter. We can do this easily be creating a new row filter:

We can also generalize this approach to find the Nth encounter for each patient — take a look at the RANK, DENSE_RANK, and ROW_NUMBER methods.

And if we want to be more specific on our partition (e.g., for the first / last encounter in a given calendar year), we can always apply additional variables (year ) when defining our partition.

export
export
Gini Index by US County 2018 (Example Project)Google Data Studio
Redivis | re3data.org
programmatically
embedded
dataset page
metadata
documentation
summary statistics
labels and descriptions
access requirements
requirements
apply

! We also suggest you recreate this workflow as we go to best learn the process.

For this workflow we'll need our initial data to train the model on (in this case Yelp reviews) and the data we want to apply the model to (Reddit posts). These data are already on Redivis, split across two datasets uploaded to the Redivis : and .

has two tables with over 150 million Reddit posts and subreddit information, split across two tables. We can look more closely at the 33 variables in the Reddit posts, including univariate statistics.

We want to leverage an existing model that understands language and can generally be used for language classification. There are many open-source models that might meet our needs here; in this example, we'll use Google's .

The Redivis dataset for this model . You can also learn more about loading ML models into Redivis datasets in our .

To enable GPU acceleration, before starting the notebook, we'll choose a with an NVIDIA-L4 GPU, which costs about $0.75 per hour to run (we could use the default, free notebook for this analysis, but it would take substantially longer to execute).

With that, we can start our notebook! The full annotated notebook is embedded below, and also .

We will use a to clean the data, as transforms are best suited for reshaping data at scale. Even though we might be more comfortable with Python or R, this dataset table is 83GB and it will be much easier and faster to filter it in a transform rather than a notebook.

Add a step. Conceptually we want to keep records that are part of the subreddit we are interested in, and are not empty, deleted, or removed.

Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and them. Perhaps we want to look at other subreddits, or run the model on a larger sample of the Yelp data.

You can also this workflow to work on a similar analysis, or any table in this workflow to analyze elsewhere.

This guide outlines an example hosted by the Redivis team. Follow along at the to see all the details and build your own!

It's easy to use of Redivis data anywhere on the web, either by via our client libraries, via one of our export integrations, or directly in another webpage.

To present Redivis data in a fully customizable way, you may want to build your own dashboard-style site. The following example shows a static site built with , deployed via , that connects to Redivis data using our client libraries.

To try a lighter-weight "dashboard" for analyzing and visualizing your data, you can use integrated . With notebooks, you can work directly in your Redivis workflow alongside relevant source data, leverage a breadth of techniques and libraries available in the Python, R, Stata or SAS ecosystems, and easily maintain a central asset to annotate, collaborate, and share.

Identify the data sources on Redivis that will provide the content to build your site. You may be interested in presenting facets of a already hosted on Redivis, highlighting tables you yourself, or showing outputs generated in one of your . In our example, we'll use weather station locations contained in the , and outputs from a that uses the same source data, to visualize locations and aggregate precipitation measurements from around the world.

To reference a Redivis table, choose the "Export table" option and navigate to the "Programmatic" tab (e.g., the ), and see the code snippet for unique identifiers for the owner, dataset/workflow, and table. More details about here.

With chosen data content in mind, the next step is to choose a set of tools to build our custom dashboard and deploy it to the web. Of the numerous choices, provides an approachable option with support for working with data in many programming languages (not just javascript), powerful visualization libraries built in, and excellent documentation. To start with a copy of our example dashboard, you can clone our public and follow the README.md instructions to develop a similar site in a local environment. To create a brand new project, and to reference additional details and development strategies, see the Observable guide.

You'll need to generate a Redivis API access token with appropriate permissions. See in our API documentation to create your own token to read data from Redivis.

You should never share tokens, and avoid committing to source control where collaborators may have access (either now or in the future). See the section below for details on how to use secrets to store a Redivis API token necessary for your project.

In Observable, a is a file that accesses data and writes it to a standard output channel for consumption by renderers, to be ultimately displayed on your site.

In our example, we'll illustrate use of both javascript and python to pull data from Redivis into our dashboard, using the via js and python client libraries.

In a web environment, the allows for fetching and manipulating data via simple javascript functions.

In our example, we first install the library by specifying the in our package.json and running npm install in our development environment to install or update all specified packages.

Then, in the Observable data loader file , we authorize access, and then listVariables and listRows from our specified precipitation table.

In a python environment, the allows for fetching and manipulating data via simple python functions.

Using the redivis-python client library requires that you develop your web application in a virtual python environment. Observable documents this process – essentially, create an environment with , activate it on your machine, and install relevant packages to the environment.

In our example, we'll create a requirements.txt file that specifies the of the client library, both for development and future deployment. Running the following command in your virtual python environment will install pip, and then any package specified in the requirements.txt file.

Then, in the Observable data loader file , we create a from our specified stations table, and output the records.

With data loaded and parsed into JSON format, it can now be manipulated and presented as you see fit. Among a wealth of modern web visualization tools is , which provides a polished selection of simple visualizations and corresponding functions to group and filter data, and pairs nicely with our development framework.

In our example, we use several Observable Plot functions to display our data on a map – using  with to show a globe – and in a hex-binned histogram – using and for data aggregation. We also add a to allow viewers to rotate the globe visualization.

Then, we use the built-in command to print the JSON-shaped payloads, to give the viewer a quick look at the raw data.

You can run npm run deploy in the command line, and follow the prompts to your project to an address on Observable. Further configuration and permissioning is available through your Observable account.

Next, we write a Github Action to build and deploy the project, which will run anytime we push to the main branch of our repository. The action is specified via the file, and relies on a set of open-source actions and our own custom code to do the following steps:

After adding this file to your Github repository, any push to the main branch will trigger a deployment following the steps above, the status of which can be viewed under the tab of the repository.

Finally, Redivis has hosted a number of different team projects at a custom endpoint specified in our organization's Github Pages settings, which you can see at .

! We also suggest you recreate this workflow as we go to best learn the process.

All the image data we need is contained in the Demo organization dataset .

Click on the table Training images and create a . This interface is where we will define a query which will run against our source table and create an output table. You can choose to write the query in SQL but we will use the interface for this example since it is faster and easier to use.

Add a step and name the new variable is_cat. The method will be Regexp contains which allows us to easily identify presence of the string cat from the file_name variable. This new variable will be a boolean variable where true means the image contains a cat and false means it does not.

To do this, + Add block in the Create variables step and use the Rank method. This is an which means you will use the partition ability to partition on true and false values. For each partitioned value (true and false) a rank will be assigned.

Create a new step. Conceptually we will keep records up to 5000 in the rank variable, which means it will include 5000 true values and 5000 false values.

Next we want to train and test a model using Python code the help of various Python libraries. Transforms are more powerful than notebooks but are based on SQL and operate linearly with only a single output table allowed. In order to work in Python, R, Stata, or SAS to generate visuals and other outputs we will create a node on our Training data output table.

The main libraries used to create this model are and . You can view their documentation for further details.

Perhaps we see something in this model we want to tweak, or we want to go back and change some of our underlying data. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and them.

Notebooks can also which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also this workflow to work on a similar analysis, or any table in this workflow for work elsewhere.

Learn more in the guide.

Learn more in the guide.

Learn more in the reference section.

Learn more in the reference section.

Learn more in the reference section.

Learn more in the reference section.

You can create a new dataset by navigating to the Datasets tab of your or and clicking the New dataset button.

All tabular data is associated with a and each dataset can have one or more tables. While you may release a dataset without any tables, this will be of limited use to other researchers, as Redivis provides numerous tools for understanding, querying, and generally working with tabular data.

Learn more in the guide.

Learn more in the guide.

The abstract should be a brief overview of the dataset, while the rest of the documentation can be as thorough as you'd like. Each documentation block has a header for suggested content, and any you don't fill out won't be shown on the dataset page. These blocks contain rich text editor complete with embedded images. Most of this information will be visible to anyone with , though you can also that require a higher level of access.

Make sure to audit your data's provenance information to give attribution to whoever is working on the data. If this dataset is part of an organization you can to issue a DOI for each dataset. Note that if your organization is configured to issue DOIs, then one will automatically be issued for this dataset when you first publish it.

To help users understand what each table represents, you should update the description, entity, and temporal range for each table in the dataset. The should define what each row in a table represents: is it a person? an event? a charge? The temporal range can be tied to a specific variable (using the min/max of that variable), or defined explicitly.

The tables in your dataset are made up of named variables, though rarely is this name enough to understand what the variable measures. On any table, click "Edit variable metadata" to order to populate the .

On each variable, Redivis supports a , , and . The label is the most essential item, think of it as a more human-readable variable name. The description should contain more detailed information, everything from caveats and notes to collection methodology. Value labels are only applicable when the variable is encoded with keys (often integers or short strings) that map to the actual value — for example, a survey might be encoded as 0: "No" 1: "Yes" 2: "Don't know" 3: "Declined to answer".

Editing variable metadata can be a tedious process, but Redivis does support the ability to , and will also automatically extract metadata if it's present in the uploaded data files (e.g., Stata or SAS upload types).

Learn more in the reference section.

Learn more in the reference section.

If the dataset is hosted by an , you will have additional options for configuring access to the dataset. The dataset can be assigned to a permission group to help standardize access procedures, and this permission group can contain that help data managers fulfill contractual requirements and gather relevant information about the research being done on the dataset.

Learn more in the guide.

Once a dataset is released, you can return to it to make changes at any time. Changes to datasets are tracked in Redivis as . Anyone with access to a dataset can view and work with any of its versions.

Learn more in the , and reference sections.

Learn more in the guide.

This guide assumes you have already started by .

You can directly from your computer, or import from a linked account.

Once selected, you will need to choose a to put them in. All files on a dataset live within one folder which have a corresponding index table to help researchers understand and navigate those files. You can change this later if needed. If this is your first time uploading files to this dataset you will need to create your first folder.

Learn more in the guide.

This guide assumes you have already started by .

You can directly from your computer, or import from a linked account.

When uploading files to tables, remember that every row in a table should represent the same "thing", or ; we wouldn't want to combine county-level and state-level observations in one table.

If you haven't already, we very strongly recommend experimenting with the reshaping and analytic tools in a workflow which researchers will use to work with your dataset. Knowing how they will work with it might inform how you structure it during this setup process, and can save time for everyone. You can even add your unreleased dataset to a workflow for testing — click on "View dataset page" from the dataset overview, and then .

For a full list of supported file types, as well as advanced functionality (such as wildcard imports) and error handling techniques, consult the .

Learn more in the reference section.

Learn more in the guide.

If you're interested in using Redivis for your research group, center, or institution, to get an organization set up.

Want a custom URL for your data portal (e.g., data.university.edu)? and we'll work with your IT to get it configured.

Learn more in the guide.

Learn more in the guide.

Learn more in the guide.

If you are part of a larger institution, you can about getting an institution-wide Redivis page to help users discover new organizations and datasets.

The following example is illustrated on Redivis in the – you'll need access to data to view the details.‌

In this example, we will process the table to create a table in which each row describes a single period of continuous enrollment. We show an artificial snapshot of the dataset below, where patient 1 has multiple periods of continuous enrollment due to some gaps in coverage, and patient 2 has a single period of continuous enrollment.

We see via boolean variables is_start_continuous_period and is_end_continuous_period (created the method) if a given row corresponds to the start of a continuous period, the end of a continuous period, or both.

Then, to capture only the start and end of a continuous period, we'll use a to keep only rows which are true for either is_start_continuous_period or is_end_continuous_period.‌

All interactions with data on Redivis require the user to have the appropriate to the data for a given action. Ideally your data would be as permissive as possible to allow for the greatest exploration from researchers before they need to start applying for access.

Learn more in the reference section.

Learn more in the reference section.

Learn more in the reference section.

Learn more in the reference section.

Learn more in the guide.

If you are part of a larger institution, you can about getting an institution-wide Redivis page to help users discover new organizations and datasets.

If you haven't yet, you want to work with to a Redivis dataset and release the version. A personal dataset or one belonging to an organization works just as well.

For this guide you can follow along in this workflow .

Use this to edit this table. Some common actions include:

Create new steps to and any variables you'd like to update.

Now release this version! This new version of the data contains the edited data. Anyone using this data in a workflow will see that this dataset has a new version next time they open a workflow. You can see the and the of our Demo tables updates live.

Learn more in the guide.

Learn more in the reference section.

Learn more in the reference section.

If you are part of a larger institution, you can about getting an institution-wide Redivis page to help users discover new organizations and datasets.

Computing a min or max value is an aggregate operation that scans all records and chooses the highest or lowest value. To compute the min or max value in a variable we will want to use an analytic method. We can do this either by our table (which would drop records and/or variables) or by using a partition to calculate a new variable without changing the rest of the data.

You can see more examples of using partitions with analytic methods on the step page.

This workflow is on Redivis
Demo organization
Yelp Reviews (Hugging Face)
Reddit
This dataset
BERT-base-cased model
viewable on Redivis
transform
Filter
Guide to running ML Workloads on Redivis
Python notebook reference
R notebook reference
static site
Observable on Redivis Github repository
interfacing programmatically
visualizing
embedding a table
Observable Framework
Github Pages
Redivis notebooks
dataset
uploaded
workflows
GHCN Daily Weather dataset
public Redivis workflow
GHCN Daily Weather "Stations" table
referencing Redivis resources
Observable Framework
Github repo
Getting started
Authorization
data loader
Redivis API
redivis-js client library
latest version
redivis-js-precipitation.json.js
redivis-python client library
here
uv
latest version
redivis-python-geodata.json.py
pandas dataframe
Observable Plot
Plot.plot()
Plot.geo()
Plot.dot()
Plot.hexgrid()
range input
display
deploy
deploy.yaml
Actions
labs.redivis.com
This workflow is on Redivis
Example data files
transform
Create variables
Filter
Notebook
Keras
Tensorflow
Work with data in a workflow
Reshape tables in transform
Downloads
Redivis-python library
Redivis-r library
Google Looker Studio
Google Cloud Storage
Google BigQuery
can be found here
accompanying guide
Deploy and share your site

patient_id

enrollment_start_date

enrollment_end_date

1

2012-01-01

2012-01-31

1

2012-02-01

2012-02-28

1

2012-04-01

2012-04-30

1

2012-06-01

2012-06-30

1

2012-07-01

2012-07-31

1

2012-08-01

2012-08-31

2

2012-01-01

2012-01-31

patient_id

enrollment_start_date_continuous

enrollment_end_date_continuous

1

2012-01-01

2012-02-28

1

2012-04-01

2012-04-30

1

2012-06-01

2012-08-31

2

2012-01-01

2012-01-31

patient_id

enrollment_start_date

enrollment_end_date

lag_end_date

lead_start_date

1

2012-01-01

2012-01-31

NULL

2012-02-01

1

2012-02-01

2012-02-28

2012-01-31

2012-04-01

1

2012-04-01

2012-04-30

2012-02-28

2012-06-01

1

2012-06-01

2012-06-30

2012-04-30

2012-07-01

1

2012-07-01

2012-07-31

2012-06-30

2012-08-01

1

2012-08-01

2012-08-31

2012-07-31

NULL

2

2012-01-01

2012-01-31

NULL

NULL

patient_id

enrollment_start_date

enrollment_end_date

lag_end_date

lead_start_date

diff_lag_end_enrollment_start

diff_enrollment_end_lag_start

1

2012-01-01

2012-01-31

NULL

2012-02-01

NULL

1

1

2012-02-01

2012-02-28

2012-01-31

2012-04-01

1

31

1

2012-04-01

2012-04-30

2012-02-28

2012-06-01

31

31

1

2012-06-01

2012-06-30

2012-04-30

2012-07-01

31

1

1

2012-07-01

2012-07-31

2012-06-30

2012-08-01

1

1

1

2012-08-01

2012-08-31

2012-07-31

NULL

1

NULL

2

2012-01-01

2012-01-31

NULL

NULL

NULL

NULL

patient_id

enrollment_start_date

enrollment_end_date

lag_end_date

lead_start_date

diff_lag_end_enrollment_start

diff_enrollment_end_lag_start

is_start_continuous_period

is_end_continuous_period

1

2012-01-01

2012-01-31

NULL

2012-02-01

NULL

1

true

false

1

2012-02-01

2012-02-28

2012-01-31

2012-04-01

1

31

false

true

1

2012-04-01

2012-04-30

2012-02-28

2012-06-01

31

31

true

true

1

2012-06-01

2012-06-30

2012-04-30

2012-07-01

31

1

true

false

1

2012-07-01

2012-07-31

2012-06-30

2012-08-01

1

1

false

false

1

2012-08-01

2012-08-31

2012-07-31

NULL

1

NULL

false

true

2

2012-01-01

2012-01-31

NULL

NULL

NULL

NULL

true

true

patient_id

enrollment_start_date

enrollment_end_date

is_start_continuous_period

is_end_continuous_period

1

2012-01-01

2012-01-31

true

false

1

2012-02-01

2012-02-28

false

true

1

2012-04-01

2012-04-30

true

true

1

2012-06-01

2012-06-30

true

false

1

2012-08-01

2012-08-31

false

true

2

2012-01-01

2012-01-31

true

true

patient_id

enrollment_start_date

enrollment_end_date

lead_end_date

is_start_continuous_period

is_end_continuous_period

1

2012-01-01

2012-01-31

2012-02-28

true

false

1

2012-04-01

2012-04-30

2012-06-30

true

true

1

2012-06-01

2012-06-30

2012-08-31

true

false

2

2012-01-01

2012-01-31

NULL

true

true

patient_id

enrollment_start_date_continuous

enrollment_end_date_continuous

1

2012-01-01

2012-02-28

1

2012-04-01

2012-04-30

1

2012-06-01

2012-08-31

2

2012-01-01

2012-01-31

comments

patient_id

encounter_date

1

2012-01-01

1

2011-01-01

1

2010-01-01

1

2013-01-01

2

2009-01-01

2

2008-01-01

2

2015-01-01

3

2014-02-01

patient_id

encounter_date

min_encounter_date

max_encounter_date

1

2012-01-01

2010-01-01

2013-01-01

1

2011-01-01

2010-01-01

2013-01-01

1

2010-01-01

2010-01-01

2013-01-01

1

2013-01-01

2010-01-01

2013-01-01

2

2009-01-01

2008-01-01

2015-01-01

2

2008-01-01

2008-01-01

2015-01-01

2

2015-01-01

2008-01-01

2015-01-01

3

2014-02-01

2014-02-01

2014-02-01

Example tasks

We've started building a library of common actions that you might take when administering an organization.

Please let us know if there are other actions or concepts you would like us to provide examples for by reaching out or emailing contact@redivis.com

workspace
administrator panel
table
Upload tabular data as tables
Upload unstructured data as files
configure a DataCite account
Documentation
Dataset sampling
Configure access systems
versions
Usage and limits for users
Billing for organizations
Analyze data in a workflow
Creating a dataset
upload data
Create & manage datasets
Creating a dataset
upload data
Uploading data
Create & manage datasets
contact us
Contact us
Create & manage datasets
Configure access systems
Grant access to data
contact us
MarketScan Continuous Enrollment workflow
MarketScan
MarketScan Enrollment Detail
case
filter
access level
Access levels
Configuring access
Requirements
Permission groups
Grant access to data
contact us
upload the data
Demo tables edits
transform
rename
retype
release notes
updated table
Analyze data in a workflow
Approving access
contact us
new variable
aggregating
Create variables
add the unreleased dataset to your workflow
Logo
Logo

Creating an account

Overview

By creating an account on Redivis, you can gain access to hundreds of restricted datasets and powerful querying tools. There is no cost to creating an account, and you may permanently delete your account at any time.

Your account tracks all of your workflows, queries, collaborations, and data approvals with different organizations. After creating an account, you will be able to apply for membership with various organizations on Redivis and request access to their restricted data.

Video guides

For more extensive full-workflow video walkthroughs, see our Events and press page:

Or read through our Example workflows:

Emailing subsets of members

Redivis has some built-in methods for helping you find and manage all of your organization's members, workflows, datasets, and access systems that you can use to ease your workflows.

In this example we will imagine that we need to get in touch with everyone working with a specific dataset.

1. Locate the dataset in the administrator panel

We can go to the Datasets tab of the panel and search or filter to locate our dataset.

2. Find relevant workflows

We can right click on this dataset and hover on View related to see all of the options relating to this dataset. In this case we want to see all workflows that have used this dataset.

Clicking this option navigates us to the Workflows tab, with a filter active for this dataset. This list is showing all workflows that contain this dataset.

3. See related members

Click the Select all checkbox to select all of these datasets. With these selected we can right click on any of them (or click the Actions button that appears in the top right) to View related activity.

In this menu we should select Members in workflows. This will navigate us to the Members tab filtered by all members who are owners or collaborators on the workflows we had selected.

4. Show emails

Now that we have narrowed it down to the members we are interested in, we can right click on the header (or the options menu in the top right of the list) to edit the columns shown. In this menu we can add the Contact email column.

The quickest way to get this information out of Redivis is to go back to that same menu and click Download CSV to begin downloading a CSV of the list we are looking at exactly as you've configured it (all columns, filters, and sorting will be preserved).

Next steps

We can use these similar steps to view:

  • Queries run on a specific set of datasets

  • Exports made from a workflow

  • Requirements a member has filled out

And many more!

Your account

Overview

Your Redivis account establishes your identity on Redivis, and provides a home for various forms of data-driven investigation and collaboration.

Because researchers regularly move between institutions, and often want to be able to reference and access historic work, it is strongly encouraged that you:

  • Only have a single Redivis account, much as you would only have a single Google Scholar account.

Communication and privacy

When you apply for access to an organization's datasets, they will be able to see some information pertinent to your membership in their organization:

  • Full name

  • Contact email

  • Email / institutional identifier

  • Affiliation, if provided by your institution's login provider

Generate a report

Overview

All activity surrounding your organization and it's data usage is logged and accessible to organization administrators. This can be helpful for year-end reports or any potential data security concerns!

You can easily organize, visualize, and download any of your organization's information by generating a report in your administrator panel.

1. Create a report

In the future you can return to this page to see your existing reports and create duplicates to further edit.

2. Choose the entity of the report

Start by choosing the entity, or topic, of the report. Whatever you choose here will define what each entry in your report represents.

You can choose a Resource such as Members, Workflows, Studies, or Datasets which will allow you to get a sense of the named entities your organization owns or has interacted with.

Or you can choose a Usage event such as Queries, Notebooks, Exports (or all of these combined) which will allow you to see how your organization's data has been used.

3. Define constraints

From here, define the other criteria of your report:

Time frame

If you choose a usage event as your entity you will need to set a time frame to define which events will be included. (Resources don't have a time associated with them since they are not events.)

Aggregation

Filter

You can additionally choose a filter that will restrict your report to only events or resources that meet your chosen criteria. For example, if you want to see all usage events that referenced a particular dataset.

4. Define the report's fields

Finally, you'll need to select what fields or topics are present in the report. Every field selected here will become a column in your resulting report table. Different fields are available based on what entity you have selected.

Some fields are selected by default that might be helpful but you can make your own selections. For example, maybe you have a report of data usage and you want to include the average number of tables referenced in each notebook session, or the max amount of compute used in queries per day.

5. View the report results

Click the Create button to generate this report! This will generate a table with entries that match all the criteria you've defined above. You can browse this table to understand the data and make sure it looks how you'd expect.

If your report contains any numeric data you can click the Visualize tab to see a visual representation of 1-2 of your numeric fields. The type of graph will be chosen by default based on the shape of your data. You can hover on this graphic to learn more about each entry.

5. Iterate and export

You can update this report by clicking the Edit button. Changing any criteria will refresh the table and visual with new contents based on your new criteria.

This report is now saved in your organization's reports and you can access it in the future.

Any time you edit the report or come back to it after more than one hour has elapsed, the table will be automatically regenerated. Based on the time frame you've selected this might drastically change the contents.

If you'd like to work with the data in a different environment you can download the report table as a csv by clicking to Download button.

Next steps

Expand your reach

Create & manage datasets

Overview

Datasets are a core component of Redivis. They are a versioned collection of tables containing data, alongside rich documentation and metadata.

Datasets can be hosted by organizations or individual users, and every dataset has its own Dataset page. Datasets can be shared with other users on Redivis according to their access configuration.

1. Create a dataset

Administrators can create datasets for their organization from the Datasets tab of their organization's Administrator panel. These datasets can be seen and managed by any administrator in the organization. When released they will be visible on the organization’s home page to anyone who as overview access tot he dataset.

Alternatively, anyone with a Redivis account can create a dataset on the Datasets tab of their Workspace. These datasets are by default only visible to their owner, and have simplified options to support sharing with your collaborators.

When you first create a dataset, it will be unpublished and only visible to other editors. This means you can edit the dataset and validate everything before releasing it. You can also validate and reconfigure its access rules before release.

2. Import data

At the core of every dataset is the data it contains, so we recommend starting here.

All data in a dataset is stored in tables. You can create a new table on the Tables tab of your dataset and start importing data. Redivis can upload data from your computer or another location you’ve linked, such as Box, Google Drive, AWS, and Google Cloud.

Once your data is finished importing you can validate that this table looks as you expect it to.

You can create more tables here if this dataset has multiple separate tables.

However, if your data is split across multiple files that all follow the same structure (such as a different file for each state, or each year of data but with generally the same variables) you will want to import all of these files to the same table, where they will be automatically appended together.

3. Populate metadata

Metadata is essential to helping your researchers find and utilize your dataset. While some metadata will be generated automatically, such as variable summary statistics and counts, other metadata will require additional input.

Dataset metadata

On the Overview tab of the dataset there are multiple suggested sections to help break down information. You can fill out any of these that apply, such as the methodology, tags, contact information, etc.

Redivis will automatically generate citation and provenance information based on what we know about the dataset, but you can update this information with anything more specific.

If you have additional information that you want to include that doesn't fit one of these headers, you can create a custom section. Custom sections can also be set to be visible only to certain access levels if you have sensitive information.

Table metadata

You should also populate the metadata on each table. Tables can have a description, as well as an entity field that defines what each row in the table represents. You can also define the temporal and geographic range on the table, when relevant.

Variable metadata

Each variable within a table has its own metadata. The variable name and type will be pre-determined from your data, but you should add a short label and longer description to each variable to help researchers understand what that variable measures.

Additionally, some variables will contain coded values, in which case you should provide value labels that represent the human-readable term for each code.

4. Release

Once you are ready to make your dataset available to others, you'll need to release it. You can click the Review and publish or Review and release button in the top right of the page.

You'll want to double check all of your data before moving forward. While you can continue to edit the documentation and metadata after the version is released, the data in a version cannot be changed.

You can unrelease a version for up to 7 days after release, though this should generally be avoided. When updating your data to correct for mistakes, you’ll need to release a new version.

You should also confirm your access settings you set up when creating this dataset.

Once this dataset is released, it will become visible and available to anyone who would qualify for access.

Next steps

Edit data

You can use tools right on Redivis to create new versions of your dataset.

Administer your organization

Organizations allow for groups, centers, and institutions to more easily work with data by providing administrators with tools to effectively version and distribute data from a central location.

Datasets

Overview

Studies

Overview

Studies

You can create a study by clicking the + New button on the studies tab of your workspace. Any study you are a part of will appear in your workspace. Any edits you make will appear for all study collaborators.

Edit

You can add collaborators to your study. Collaborators can edit the study and will by default have edit access to all workflows created in this study.

When you create a study, you are assigned as the study's PI in the collaborators list. Organization administrators use this field to better track studies applying for their data.

If you add more users to a study you can change the PI assignment. Over on a user's name and click Make PI to transfer. Note that once you change the PI you will no longer have the ability to change the PI assignment.

You can also give a study a description, which will help administrators and other study members better understand its goals.

Workflow

Any workflow created in this study will share edit access with all other collaborators in this study by default.

Datasets

The list of datasets will be automatically populated as you use datasets in workflows in this study. You can all add datasets to it directly in order to make the access process easier later on.

Compute credits and billing

Overview

Compute credits can be used to provision advanced compute environments for notebooks.

The default notebook configuration is a free resource on Redivis, and has access to 2 CPUs and 32GB working memory, alongside a 60GB SSD disk and gigabit networking. This is similar to most personal computers, and for many analyses should be plenty of compute power!

Advanced compute environments will cost a number of compute credits per hour to run depending on the machine type. This amount will be clear when configuring and starting the notebook. In order to start a notebook with an advanced compute configuration you must have enough compute credits to cover at least 15 minutes of running time.

Purchase compute credits

Credits cost $0.10 and can be purchased in increments of 100 ($10), 200 ($20), 500 ($50), or 1000 ($100). Click the Purchase credits button and choose the amount you'd like to purchase.

You can purchase credits immediately using a credit card or bank account, or through an invoice.

Credit card / Bank

When clicking Checkout you will be rerouted to a Stripe payment processing page. The card you enter here will be processed by Stripe and never seen by Redivis or stored in our databases.

Invoice

You can choose to generate an invoice for your credit purchase. Any information you enter in the custom fields section will appear on the invoice and might be required by the organization paying the invoice. Once generated the invoice will be downloaded as a PDF to your computer. You can also return to view the invoice and copy the link to the invoice to pay electronically. Once the invoice is paid the credits will appear in your account.

Auto-purchase compute credits

You can set up your account to purchase credits every time your account dips below 10 compute credits. This ensures that your notebooks will never halt mid-session due to a lack of compute credits. Here, you can select the amount of credits you'd like to purchase every time this condition is met and the card you would like to be charged for it. If you'd like to change the amount or cancel auto-purchase you can do so by returning to this screen.

Refund compute credits

Billing dashboard

Redivis uses Stripe for all payment processing. If you'd like to view or edit your card on file or view previous purchases you can do so in the Stripe billing dashboard, which is linked from this page.

Usage limits

User accounts are currently limited to 10GB of storage. You can see your current usage on the Settings page of your Workspace, under the Compute Credits and Billing tab.

Managing logins

Overview

In order to create a Redivis account, you must authenticate through your academic institution, through any Google account, or via passwordless login. Over 2,500 academic institutions from around the world are currently supported, with more being added regularly.

Redivis will request some basic information when you authenticate, such as your name, email, and any relevant affiliations; it will not have access to any other information through your Google or institutional account.

Adding emails / authentications to your account

It is strongly encouraged that each individual has only one Redivis account, and that your identity on Redivis map to your real-world identity. However, it is common that you might have multiple emails that you want to associate with your Redivis account: personal email, university email, visiting-scholar university email.

If you lose access to all of emails associated with your account, you will no longer be able to log in to Redivis.

For this reason, it is strongly encouraged that you add a personal email to your account. This way, if you ever lose access to your institutional email, you will still be able to log in to your Redivis account (though you may not have access to certain datasets that required your previous institutional affiliation - see below).

Logins and organization membership

When you join an organization, your membership in that organization will be associated with a particular login. In order to work with that organization's restricted data, you must have recently authenticated with the login associated with your membership. If you sign in to Redivis using a different login (for example, you signed in with your personal email, while your membership is associated with you academic institution's credentials), you'll be prompted to re-authenticate with the relevant login before you can work with restricted data.

If you lose access to the login originally associated with your membership — for example, if you change institutions — you can request that your membership be updated to associate it with a new login. An administrator of the organization will then need to approve this request before access through your membership will be restored.

Merging accounts

If you have more than one Redivis account, we strongly encourage you to merge them. In order to merge multiple accounts, you should perform the following steps:

  1. Identify which account should be your "primary" Redivis account once the process is complete.

  2. Log in to your non-primary account.

  3. Transfer ownership of all datasets, workflows, and studies to your primary account. Alternatively, you can delete any datasets, workflows, and/or studies that are no longer relevant.

  4. Log in to your primary account, and add the authentication(s) that were previously associated with your non-primary account to your primary account.

Workspace

Overview

Your workspace dashboard has links and information to help ground yourself in the work you are doing on Redivis. Beyond the dashboard, you will see pages listed on the left bar with different aspects of your account.

Workflows

Datasets

Organizations

Studies

Dataset library

This section contains any dataset that you have bookmarked. You can manually bookmark a dataset by clicking on the bookmark button next the dataset title on any dataset page. Redivis will also automatically bookmark datasets that you add to prjoects, or where you have applied for access. You can remove these at any time by right clicking on a dataset in this list and selecting the Remove from library option.

Logs

All activity you take when working with data on Redivis is logged. You can visit the Logs page of your account to view the details of your Query, Notebook, and Export actions.

Settings

Public profile

You can specify the following information, public to all users and organizations on Redivis:

  • Name: This is your full name displayed across Redivis. This will appear when collaborators search for you to share their workflows, and the name administrators will see when interacting with any access steps you've submitted. It does not need to be unique.

  • Username: A unique handle for your account.

  • Profile picture: A picture visible next to your name throughout the platform.

  • Disciplines: Areas of interest and study that best represent your work and research.

Authentications

Data sources

To allow you to import your data from wherever it resides, Redivis supports multiple integrations with outside services.

API tokens

Compute credits and billing

Advanced

This tab allows configuration of the following:

  • Contact email: Specify a contact email to receive email notifications for any update to your data access applications. If you apply to access data hosted by an organization, that organization will be able to see your contact email. By default, this is the email you used initially to create your Redivis account.

  • Security: Shows all current sessions of Redivis across devices and browsers, and provides the ability to log out of any of these sessions.

  • Delete account: Allows permanent deletion of your account, including all workflows you have created.

Note you cannot delete your account if you are the only remaining administrator of an organization, or if you still have any non-deleted datasets.

Stata license

SAS license

  1. Run SAS Deployment Wizard to install SAS Studio on a supported Linux 64-bit operating system. During the installation, change the default location for the SAS Studio installation to /usr/local/SASHome.

  2. For details about how to install SAS Studio, refer to the installation and configuration instructions provided by the product.

  3. Create a TAR file that includes the SASHome directory:

Once this tar file has been created, please reach out to contact@redivis.com to share the file so that we can enable SAS for your account.

Single Sign-On (SSO)

Overview

When working with non-public data on Redivis, it's often important to be able to authoritatively attest to their affiliation with a given institution or other entity. To this end, Redivis supports Single Sign-On (SSO) through most academic institutions, as well as the ability to establish identity through a Google account or validated email address.

Institution SSO via SAML

Troubleshooting institution SSO

In most cases, logging in with your institution will "just work". However, due to inconsistencies in how certain standards are applied around the world, you may run into issues when logging in through your institution. These issues can often be resolved with a quick ticket with your IT support desk – we recommend that you direct them to this page and copy support@redivis.com so that we may provide further technical information if needed.

Some common issues are outlined below:

Your institution does not support the Redivis service provider

Redivis was unable to determine identity

This error will occur after you've logged in with your institution, upon being redirected back to Redivis. In this case, the authentication request completed successfully, but your institution didn't provide enough information for Redivis to know who you are (which is important in order for you to apply for restricted data, so that the data distributor can be confident of who they're granting access to!).

For identity provider administrators

Redivis requires the following attributes:

  • eduPersonPrincipalName (urn:oid:1.3.6.1.4.1.5923.1.1.1.6)

  • email (urn:oid:0.9.2342.19200300.100.1.3)

  • name (urn:oid:2.16.840.1.113730.3.1.241)

The following attributes are optional but encouraged if available:

  • affiliation (urn:oid:1.3.6.1.4.1.5923.1.1.1.1) or scopedAffiliation (urn:oid:1.3.6.1.4.1.5923.1.1.1.9)

  • orcid (urn:oid:1.3.6.1.4.1.5923.1.1.1.16)

  • pairwiseId (urn:oasis:names:tc:SAML:attribute:pairwise-id)

  • eduPersonTargetedId (urn:oid:1.3.6.1.4.1.5923.1.1.1.10)

Other error messages

While uncommon, it's certainly possible that other errors might occur when logging in through your institutional credentials. If you do, please contact support@redivis.com and we'd be happy to help you troubleshoot.

SSO via Google

Redivis also supports the ability to sign in via any Google account. This can be a personal gmail account, or via your organization if it supports Google single sign-on. When you sign in with Google, your name, email, and an opaque persistent identifier will be shared with Redivis.

If your institution supports Google sign-on, but is also listed as a SAML identity provider (see above), the SAML SSO will be preferred. If you try logging in via Google, you will be redirected to your institution's login page.

Email sign-on

If your institution isn't listed and doesn't support SSO through Google (e.g., many @.gov emails), you can also sign in via any email address.

Redivis will send a unique code to this email every time you log in, making it such that the account owner continuously "proves" their ownership of the given email address.

For security purposes, you must enter the code sent to your email in the same window from which it was initially requested. If you want to log in from a new window / device, you can request a new code.

export
export
create custom documentation sections
requirements

To create a new account, click the Create account button at the top of any page. You can create an account using your academic institution's login credentials (e.g., you@your-university.edu), logging in with any google account, or via passwordless authentication through any email. Your account can then be managed from .

It is strongly encouraged that each individual has only one Redivis account, and your identity on Redivis should map to your real world identity. You can (and should) to one account — e.g., personal email, university email, visiting-scholar university email.

(e.g., university credentials, personal google account) to your Redivis account, to ensure that you can always access it.

Redivis will never, ever distribute your personal information to a third party. We will send data access notifications to your contact email, though this can be disabled in your . You may opt-in to receive occasional product updates, though this is turned off by default.

Navigate to the tab of your administrator panel and click the + New report button.

If you chose a usage event you can optionally choose to aggregate events based on a resource or period of time. Perhaps you want to see all queries grouped by member, or want to see all notebooks per day. You can also create custom aggregation groups by using and with colons. E.g. labeling members with school:medicine and school:business would allow you to aggregate a report by school.

Learn more in the reference section.

If there are any fields or other report information you would like but don't see available in the interface, please to let us know!

If you are part of a larger institution, you can about getting an institution-wide Redivis page to help users discover new organizations and datasets.

Learn more in the guide.

Learn more in the guide.

Learn more in the guide.

to set up a new organization, and learn more in the guide.

Datasets are a core component of Redivis. Datasets contain various metadata and documentation, as well as one or more and/or containing data.

All datasets have their own persistent URL and are by either a user or an organization. Datasets can be added to to analyze and combine with other datasets across Redivis. Datasets are automatically versioned and you can always update the version you are viewing or working with.

Some components of a dataset may not be available to you until you are . In order to see the existence of a dataset, you must at least have .

You can to use in your workflows and share with colleagues, or create datasets within any organization that you administer.

New to Redivis? Learn more in our guide.

A study is a group of users working together with a common goal. Studies allow users to apply for access to study requirements on datasets as a group. It also makes sharing workflows and easier.

If you are working with larger tables, creating an ML model, or need to use a GPU you can choose to configure more advanced when setting up your notebook.

Credits on Redivis never expire and can be refunded at any time. If you would like to refund the compute credits on your account please to initiate the refund process.

If you'd like to increase that limit and host larger datasets, about creating an Organization!

In order to add an email to your account, navigate to the , and click Add authentication in the "Authentications" tab. A new browser window will open requesting the relevant credentials, after which this new email will be associated with your account.

Delete the non-primary account by clicking "Delete account" within your . WARNING: This action cannot be undone.

The is the home of your content on Redivis. It is created when you make a Redivis account and accessible by clicking the My workspace button in the right of the header bar from anywhere on the site. Your workspace is only visible to you and is where you will manage your work on Redivis.

Any that you've created, or that have been , will show up here. You can create a workflow by clicking the + New workflow button on this page in order to start working with data. Clicking on a workflow here will open up the workflow interface.

Any that you've uploaded or can edit will appear here. You can create a dataset by clicking the + New dataset button on this page in order to start uploading and sharing your data. Clicking on a dataset in this list will open the dataset editor which will allow you to work on the dataset.

This page lists all that you're a member of. If you have applied for access to any restricted data, a copy of your submissions will be listed under the organization you applied to. You can search for new organizations to join by clicking the + Find new organization button.

allow users to more easily work in groups. You can create a study by clicking the + New study button on the studies tab of your workspace. Any study you are a part of will appear in your workspace. Any edits you make will appear for all study collaborators.

ORCID iD: Link your ORCID ID so organizations can see research you are working on. See for more information on setting up an account.

You can use any email listed here to log in to your account. If you have more than one email address, the one marked as 'Primary' will be the one to receive any emails. You can by clicking Add authentication button. If the email you add is associated with an institution or organization, the authenticated information we receive from them will be listed here. A single authentication will be used by organizations to verify your identity when you .

Redivis will only ever read data from these sources; it will never modify or overwrite content. Additionally, Redivis will only access and transfer content when you specify a location in one of those sources as part of the .

You can create and manage tokens here for working with the Redivis API. See the for more information.

You can use to purchase advanced compute environments for notebook nodes within workflows.

Communications: Allows you to configure email notifications for any update to your data access applications, as well as occasional product update emails regarding changes to Redivis. Note that, if you opt to receive email notifications, you will still see a comprehensive list of in the web platform.

In order to enable Stata for your account, you can provide your Stata license information here (alternatively, you'll be able to use Stata if one of your organizations provides a license). Specifically, you'll need the license "code", "authorization", and "serial number", which should all have been provided as part of purchasing Stata.

In order to enable SAS for your account, you can provide your license information here (alternatively, you'll be able to use SAS if one of your organizations provides a license). Because of how , the steps to enable SAS are a bit more complicated than for Stata. Specifically, you'll need to complete the following steps:

Redivis is a registered service provider within the US-based , which in turn is part of the , enabling secure, authoritative SSO across thousands of universities around the world, via the SAML 2.0 protocol. If you are a member of an academic institution (as well as certain other research enterprises), you can search for your institution by name and log in to Redivis through your institution's sign-in page.

If, when choosing your institution to log in, you are immediately presented with an error page (before you can type in your password), this likely means that your institution needs to add Redivis to some sort of "service provider allowlist". As a registered service provider within InCommon / eduGAIN, most institutions will automatically accept login request from Redivis – but some require manual configuration. In this case, your IT desk will need to take a quick action to enable Redivis – it will likely be helpful to direct them to Redivis's SAML metadata, found here:

Redivis requires all institution identity providers to provide some minimal information about the individual, such as name, email, and a persistent identifier. These are codified as the "". If your institution uses OpenAthens for SSO, you can to learn more about releasing these attributes.

your workspace
add multiple logins
Events and press
Example workflows
Tie multiple emails
Reports
contact us
contact us
Upload tabular data as tables
Create and populate a dataset
Edit data in a dataset
Contact us
Administer an organization
tar -czvf SASHome.tar.gz /usr/local/SASHome
workspace settings
workspace settings
tables
files
uploaded
workflows
create your own datasets
Create & manage datasets
user-created datasets
compute resources
contact us
Contact us
settings page of your workspace
workspace settings
workspace
workflows
shared with you
datasets
Studies
orcid.org
data upload process
API documentation
compute credits
notebooks
notebooks
SAS handles cloud deployments
InCommon federation
eduGAIN federation
https://redivis.com/auth/saml/metadata
research and scholarship attribute bundle
view their documentation
authenticate with additional institutional or Google emails
apply for membership
previewed and downloaded
programatically
Folder
variable metadata
label
description
value labels
import metadata from a file
DOIs
Provenance
text
supporting files
links
Provenance

Files

Overview

Inspecting files

You can inspect an individual file by clicking on its name to launch the file viewer. Any file can be downloaded or referenced within a notebook, and many file types can be previewed directly within Redivis, including:

  • 3D models

  • Audio files

  • CIF + PDB files (molecular + protein structures)

  • FITS files (common in astronomy)

  • DICOM

  • HDF5

  • HTML

  • Images

  • PDFs

  • Videos

  • Text/code

  • TIFFs

  • ZIPs

  • TEI

Folders and index tables

All files are assigned to a "folder" within the dataset. You can click on the folder name in the right grey bar to filter this dataset's files by the folder they are in.

Each folder has a corresponding index table that is present on the Tables tab of the dataset. These will match the folder name and have a File index label.

In these index tables, each row of the table represents a file in that folder.

Variable name
Description

file_id

A unique, system generated identifier for the file. Use the file_id to reference and download specific file(s) in your analysis.

file_name

The name of the file

size

Size of the file in bytes

added_at

Timestamp for when the file was added to this folder

md5_hash

MD5 checksum of the file contents, encoded as a base64 string

Working with files

Uploading data

Overview

Redivis supports a wide variety of data types, sources, and upload methodologies. When beginning to upload data, you should first ask:

What type of data is it?

Where is the data?

  • A computer

  • Google Drive

  • Box

  • Google Cloud

  • AWS

  • Another table on Redivis

Where is the metadata?

  • Embedded in the file

  • In another file

How do I want to perform the upload?

  • Through the browser (default)

Creating & editing datasets

Overview

When creating a dataset, make sure to create it within the correct context (your workspace vs. one of your organizations), as datasets can't currently be easily moved across contexts.

Dataset visibility

While in the unpublished state, only dataset editors will be able to view the dataset, and the dataset won't be accessible to other users nor will it show up in any search results.

Dataset editors

If a dataset is owned by a user, that user has full ability to edit and modify the dataset. If a dataset is owned by an organization, all administrators of the organization will have such rights.

You can also add other editors to the dataset by selecting Configure access at the top right of the dataset, and adding the specific individuals as editors. Editors will be able to upload data, modify metadata, and release new versions of the dataset, but they cannot modify any of the dataset's access rules or change the dataset's published status.

Dataset editors cannot release the initial version of a dataset, since doing so would change the dataset's published status, which has implications for access to the dataset. Once the initial version is released, editors will be able to create and release subsequent versions.

Dataset settings

Under the dataset settings, additional options for the dataset can be configured:

Dataset name

You can rename the dataset here (or by clicking on the dataset title within the dataset editor).

Administrator notes

These notes are only visible to editors of the dataset, and may provide helpful documentation for administrative processes around maintaining the dataset.

Published status

Control whether the dataset is published. If a dataset is unpublished, only editors will have access to the dataset, though you can configure access for other users such that they'll gain access when the dataset becomes published.

All datasets are initially unpublished when created, and become published when the first version is released. If you ever need to quickly remove a dataset from circulation, you can unpublish the dataset.

Featured status

[Organization datasets only]. Whether the dataset should be featured on your organization's landing page.

DOI status

If you enable DOI issuance on an existing dataset, DOIs will be back-issued for all non-deleted version of the dataset. These DOIs will become permanent after 7 days.

If you disable DOI issuance on an existing dataset, any draft DOIs (less than 7 days old, or on an unreleased version) will be deleted. Any future versions will not be issued a DOI.

Dataset deletion

You can permanently delete the dataset here. Once deleted, the dataset will no longer be discoverable, though it will still show up in users' workflows that reference the dataset, and bookmarked URLs and DOIs will still resolve to the dataset's landing page.

To ensure future reproducibility, dataset metadata and documentation is preserved upon deletion. However, all data will be fully expunged, and the dataset will no longer by queryable.

About dataset DOIs

When enabled, DOIs are issued for all released versions of a dataset. For datasets with more than one version, a "canonical" DOI for the dataset will be issued as well.

When issued, DOIs remain in a draft state for 7 days, and will be deleted if the dataset's DOI configuration is changed. After 7 days, DOIs become permanent (this 7 day counter only begins once a version is released; unreleased versions always have a draft DOI).

DOI lifecycle

DOIs are issued for all released versions of a dataset. For datasets with more than one version, a "canonical" DOI for the dataset will be issued as well.

When issued, DOIs remain in a draft state for 7 days, and will be deleted if the dataset's DOI configuration is disabled while still a draft. After 7 days, DOIs become permanent (this 7 day counter only begins once a version is released; unreleased versions always have a draft DOI).

If a dataset is not publicly visible, its DOI will be registered, but it won't be findable – meaning that web crawlers won't index the DOI, and it won't be part of search results on platforms like DataCite. In such a case, the dataset's metadata will be redacted as well (see below).

When a dataset or version is deleted, any draft DOIs will be deleted. All other DOIs will be moved to the registered (non-findable) state, and they will continue to resolve to appropriate "tombstone" page for the dataset.

DOI metadata

DOIs are more than just persistent identifiers — they are accompanied by rich metadata that allows other tools to surface your datasets and link them to the broader body of academic work. Every field on the dataset page maps to corresponding entries in the DataCite metadata; you can view these metadata by clicking Metadata -> DataCite on the right of the "Overview" section of the dataset page. Redivis will automatically sync the DataCite metadata whenever your dataset is updated.

A note on metadata and access:

Redivis will only ever publish metadata to DataCite that is publicly visible. This means that if your dataset is completely hidden from public view, no metadata will be propagated to DataCite (including the dataset's name). Instead, a placeholder mentioning that "This dataset has been hidden from public view" will be used in place, and only the name of the dataset owner, and the dataset's contact information (if present), will be published.

While this situation isn't ideal, it allows DOIs to still be used for disambiguation when discussing the dataset, and provides a potential pathway for future investigators to contact the data owner.

Note that metadata associated with your dataset will automatically be updated if you change the public access level of your dataset.

DOIs and dataset versions

Redivis creates a DOI for every version of a dataset, allowing for authoritative references and reproducibility of data used in researchers' work. Once a dataset has more than one version, Redivis will also issue a DOI for the dataset as a whole, whose metadata will reflect the latest version of the dataset. All version DOIs will point to this canonical DOI, and also link to each other (as previous / next versions) in the DataCite metadata. This structure improves discoverability and disambiguation when referencing versioned datasets.

Discontinue or prevent DOIs

If you've uploaded your DOI issuing credentials in the Setting tab, all new datasets will by default be issued a DOI upon publishing and on every version released. If you would like to disable issuing DOIs for a specific dataset you can do so in the Setting tab of that dataset. Here you'll see a switch you can turn off to stop issuing DOIs. If the dataset is already published it will have a DOI for the dataset and all published versions that will not be removed. If you turn this switch on at any point in the future all versions will be back-issued DOIs.

Tables

Overview

A table is where tabular data is stored on Redivis. A table can be thought of as a grid made up of variables (columns) and rows (records). You'll see tables:

  • On the Tables tab of datasets

You can also learn more about the table and take further action

  • View this table's Bibliography

Table characteristics

Field

Notes

Name

The table's name. If in a dataset, must be unique across all tables for that version of the dataset. If in a workflow, must be unique across all tables currently in the workflow.

Description

Optional. A free-form description of the table's contents. May not exceed 5000 characters.

Bibliography

This table's citation, and any recorded related identifiers.

Variable count

Total number of variables in the table.

Row count

Total number of rows, or records, in the table.

Size

Total size of the table, in bytes.

Entity

Optional. Documents the concept that one record in this table represents. For example, the table's entity might represent a unique patient, or a specific hospitalization, or a prescription.

Temporal range

Optional. The range of time that this table covers. This can either be set manually, or calculated from the min/max of a particular variable.

Sample

If this table is sampled, you will see a marker for whether you are looking at the full dataset or the 1% sample. To interact with sampled tables, add the dataset to a workflow.

Viewing variables

Viewing cells

Clicking any the header or cells in any column will display the variable's summary statistics. You can also right click a cell to filter by a specific value, or sort on that variable.

For any Geography type variables, you can hover on the cell or click to see a preview.

This behavior allows for the performance and scaling characteristics of tabular data on Redivis, and is typical for relational databases and other tabular data stores.

Querying tables

Choose one of the templates or type your own SQL query, and click the Run button to see the results. You can select any column in the query results to view the summary statistics for that column.

If you update the query after execution, it will become "stale" (marked in yellow), which is a reminder that the results no longer match the query you are typing. The previous results will still be shown to help inform the new query you are writing.

Accessibility note for keyboard navigation: When you open the query tab, your cursor will be focused on the query editor. To navigate away using the keyboard, first press Escape, then press Tab to focus with other non-editor elements.

Variables

Overview

Characteristics

The characteristics on a variable help researchers understand what that variable measures and how it was collected. All characteristics (except for variable type) are indexed by the Redivis search engine — better and accurate metadata will help researchers find your dataset!

Field

Notes

Name

The name of the variable, limited to 60 characters. Must be unique within the table. All variable names are case-insensitive, and can only use alpha-numeric characters and underscores, and cannot start with a number.

Type

Label

Optional. A short, human-readable description of the variable name. Limited to 256 characters.

Description

Optional. A longer space for notes about the variable's creation methods, coding, or supplementary information. Limited to 5000 characters.

Value labels

Univariate statistics

Statistic

Description

Count

Distinct

Non-null

Min, max

Min lng, Min lat, Max lng, Max lat

For geography variables: the bounding box containing all geospatial data in a variable

μ, σ

The mean and sample (unbiased) standard deviation of the variable. Only available for continuous variables (integer, float, date, dateTime, time).

Univariate plots

Histogram

A frequency chart of observations, sorted into 64 buckets, clamped to three standard deviations of the mean on either side. To switch between a linear and logarithmic y-axis, click on the bottom left corner of the chart.

Only shown for continuous variables with more than 64 distinct values.

Box plot

A visual display of the distribution of values by frequency. Shown are the minimum, 25%, median, 75% and maximum value. To include or exclude outliers in the calculations, click on the label on the bottom left corner of the chart.

Only shown for continuous variables with a meaningful number of discrete values.

Frequency table

A table showing the frequency of common values for that variable, limited to 10,000 values. If a variable's values are highly heterogenous, no frequency table will be displayed. You can right click on values to get a quick entry point to filter on that value in the query tab.

Map

A table with variable of type geography shows a heatmap of the geospatial dispersion of that variable.

Editing metadata

To edit a variable's label, description, or value labels, click the Edit metadata button on the right of any table. This will allow you to any of the variables in this table, or you can navigate to the "All tables" tab to edit the same variable across multiple tables.

In the metadata editor, you can edit a variable's metadata fields as you would in any spreadsheet, and save changes when you're done.

Exporting metadata

Variable types

All variables in Redivis have a type associated with them. All types in Redivis support NULL values; that is, an empty cell.

String

A string can be used to store any textual data (UTF-8 encoded). Moreover, a string is a "universal" data type — any other type can be converted to a string without error.

Size: 2 bytes + UTF-8 encoded string size (1 byte for ASCII characters).

Integer

A 64-bit signed integer. Supports any integer value between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807 .

Size: 8 bytes per cell.

Float

Size: 8 bytes per cell.

Boolean

A representation of either TRUE or FALSE

Size: 1 byte per cell.

Date

Represents a calendar date independent of time zone. Supports any value between 0001-01-01 and 9999-12-31

Size: 8 bytes per cell

Format:

'YYYY-[M]M-[D]D'

  • YYYY: Four digit year

  • [M]M: One or two digit month

  • [D]D: One or two digit day

DateTime

Represents a year, month, day, hour, minute, second, and subsecond, independent of timezone. Supports any value between 0001-01-01 00:00:00 and 9999-12-31 23:59:59.999999

Size: 8 bytes per cell

Format:

'YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]]'
  • YYYY: Four digit year

  • [M]M: One or two digit month

  • [D]D: One or two digit day

  • ( |T): A space or a Tseparator

  • [H]H: One or two digit hour (valid values from 00 to 23)

  • [M]M: One or two digit minutes (valid values from 00 to 59)

  • [S]S: One or two digit seconds (valid values from 00 to 59)

  • [.DDDDDD]: Up to six fractional digits (i.e. up to microsecond precision)

Time

Represents a time, independent of a specific date. Supports values between 00:00:00 and 23:59:59.999999

Size: 8 bytes per cell

Format:

'[H]H:[M]M:[S]S[.DDDDDD]'
  • [H]H: One or two digit hour (valid values from 00 to 23)

  • [M]M: One or two digit minutes (valid values from 00 to 59)

  • [S]S: One or two digit seconds (valid values from 00 to 59)

  • [.DDDDDD]: Up to six fractional digits (i.e. up to microsecond precision)

Geography

Size: cell-content dependent

Format:

'(POINT|MULTIPOINT|LINESTRING|MULTILINESTRING|POLYGON|MULTIPOLYGON)(...)'

-- Multiple of above geometries may be wrapped with
'(GEOMETRYCOLLECTION)(...)'

Conversion rules

Some types can be implicitly converted within a query (e.g., 1 (integer) < 2.2 (float)), in other circumstances, you will explicitly need to convert the type of a variable before performing certain operations (e.g., "1.0" (string) < 2.2 (float)).

Integer

Float

Returns a close but potentially not exact FLOAT64 value.

Integer

Boolean

Returns FALSE if x is 0, TRUE otherwise.

Float

Integer

Returns the closest INT64 value. Halfway cases such as 1.5 or -0.5 round away from zero.

Float

String

Returns an approximate string representation.

Date

DateTime

Returns a DateTime at midnight on the corresponding date. For example, if x is 1970-01-01, returns 1970-01-01 00:00:00

DateTime

Date

Returns the part of the DateTime which is the calendar date. Note that this will not round the DateTime to the nearest Date. E.g., if x is 1970-01-01 23:59:59 , returns 1970-01-01 .

DateTime

Time

Returns the part of the DateTime which is the clock time. E.g., if x is 1970-01-01 23:59:59 , returns 23:59:59 .

Boolean

Integer

Returns 1 if x is TRUE, 0 otherwise.

Boolean

String

Returns "true" if x is TRUE, "false" otherwise.

String

Float

Returns x as a FLOAT64 value, interpreting it as having the same form as a valid FLOAT64 literal. Also supports casts from "inf", "+inf", "-inf", and "nan". Conversions are case-insensitive.

String

Boolean

Returns TRUE if x is "true" and FALSE if x is "false" All other values of x are invalid and throw an error instead of casting to BOOL. STRINGs are case-insensitive when converting to BOOL.

String

Geography

Geography

String

Documentation

Overview

Datasets have metadata fields that you'll want to complete to make it more discoverable and useful. The dataset abstract is required, but the rest of the documentation is optional and will be hidden to the public when empty. You can edit any of these sections by clicking on them in the dataset editor.

This metadata can be viewed on the dataset page and also downloaded in different formats. If your organization has configured DOIs this metadata will also populate the DOI listing.

Populating metadata

Abstract

The abstract is limited to 256 characters and will show up in previews and search results for the dataset. This should be a concise, high-level summary of this dataset.

Provenance

This section is intended to display information about where this dataset came from and how it came to be in its current form. Redivis will auto-populate fields where possible but you can add additional information or override it.

Dataset creator

This field should be the individual(s) or organization(s) responsible for creating the content of this dataset. This will be linked with the appropriate ORCID iD or ROR if the individual or organization has attached them to their Redivis account. You can also include individuals and organizations that don't have Redivis accounts and include their identifier. If you have multiple creators you can edit the order they are shown, which will also be reflected in the citation.

Contributor

This field is to attribute the work that different individuals have done to get the dataset into its current state. Redivis will automatically add anyone who edits the dataset to this field. If they have attached their ORCID iD to their Redivis account, that linkage will be shown as well. The contributor type options are aligned with DataCite standards.

DOI

Citation

This section shows the automatically generated citation for this dataset in your chosen format. This can be copied or downloaded for use elsewhere.

Changes made to the "Creators" field will be reflected in this citation. Any DOI issued for this workflow will automatically be included in this citation.

Citation instructions

If this dataset must be cited in a specific way, you can make that clear in this field. Citation instructions will be included beneath the citation wherever it is shown.

Related identifier

This section is for linking other identifiers (URLs or DOIs) that are relevant to this dataset. The related identifier type options are aligned with DataCite standards. You might want to include identifiers for DMPs, papers, and more.

Bibliography

You can launch a bibliography which displays the citation of this dataset and every related identifier listed above.

Methodology and Usage notes

These documentation sections are intended for more extensive information and can contain formatted text and images. These will be indexed for the Redivis dataset search.

Supporting files

Files of any type and up to 100MB can be uploaded to the dataset page where anyone with access can download them. These should not contain any data for this dataset, as access to them is managed separately.

Links

Links can be added with display names to direct someone to another URL with more information.

License

This is where you can add the license information about your dataset's redistribution policies. If this data is governed by a common redistribution license you can select it here from the menu of standard licenses. If you want to reference a license that isn't listed here you can include the link, or upload a custom license. This will be displayed on the dataset front page to let others know how they can use your data. This information will be included on the dataset's DOI.

Funding

If this dataset was funded by an institution you'd like to recognize, this is the section where you can include information about funder(s). You'll need the funding organization's name and ROR, as well as an award number if applicable. You can add multiple funders to each dataset. This information will be included on the dataset's DOI.

Contact

This section should be used to let someone viewing this dataset know how to get in touch if there is any issue or question.

Custom sections

You can create documentation sections with their own titles and assign them custom access levels.

To protect this information you can create a custom documentation section with a more restrictive access level. Users without the appropriate level of access will only see a placeholder for that section of the documentation.

Tags

In addition to documentation, you may add up to 25 tags to your dataset, which will help researchers discover and understand the dataset.

Other metadata

Usage

The information on the usage tab of the dataset page is automatically generated based on how researchers are working with the dataset on Redivis.

Statistics

Basic information about this dataset's usage

  • Views: How many times this dataset page was loaded. These are only counted for logged-in users, and a user is only recounted after 1hr of not viewing the dataset.

  • Workflows: How many workflows include this dataset.

Variable usage

The featured table shows all variables across all tables in this dataset sorted by their popularity. This popularity is calculated based on the number of times that variable is referenced when this table is used in workflows. This takes into account joining on a variable, propagating it downstream in transforms, and querying it directly.

Programmatic uploads

Overview

Consult the complete client library documentation for more details and additional examples.

Geospatial data

Overview

Supported file types

Geography data in text-delimited files

Quotas & limits

Tabular data

Overview

Supported tabular file types

Redivis will try to detect the file type based on the file extension, though you can manually specify the type as needed.

Uploading compressed (gzipped) files:

Generally, you should upload uncompressed data files to Redivis, as uncompressed files can be read in parallel and thus upload substantially faster.

If you prefer to store your source data in a compressed format, Avro, Parquet, and ORC are the preferred data formats, as these support parallelized compressed data ingestion at the row level.

Redivis will decompress text-delimited files, though the data ingest process may be substantially slower. If your file is compressed, it must have the .gz file extension if you're uploading locally (e.g., my_data.csv.gz) or have it's header set to Content-Encoding: gzip if served from a URL or cloud storage location.

Quotas & limits

Most upload types are limited to 5TB per upload. Stata, SPSS, XLS, and Shapefile ZIP directories are limited to 100GB.

Working with delimited files

A text-delimited file is a file that uses a specific character (the delimiter) to separate columns, with newlines separating rows.

Delimited file requirements

  • Must be UTF-8 encoded (ASCII is a valid subset of UTF-8)

  • Quote characters in cells must be properly escaped. For example, if a cell contains the content: Jane said, "Why hasn't this been figured out by now?" it must be encoded as: "Jane said, ""Why hasn't this been figured out by now?"""

  • The quote character must be used to escape the quote character. For example, the sequence \" is not valid for an escaped quote; it must be ""

  • Empty strings will be converted to null values

Advanced options for delimited files

Has header row

Specifies whether the first row is a header containing the variable names. This will cause data to be read beginning on the 2nd row. If you don't provide a header in your file, variables will be automatically created as var1, var2, var3, etc...

Skip invalid records

By default, an upload will fail if an invalid record is encountered. This includes a record that has a mismatched number of columns, or is otherwise not parsable. If this box is checked, the number of skipped records will be displayed on each upload once it has been imported.

Allow jagged rows

Whether to allow rows that contain fewer or more columns than the first row of your file. It is recommended to leave this option unchecked, as jagged rows are generally a sign of a parsing error that should be remedied by changing other options or fixing the file.

Cells contain line breaks

Whether newlines exist within specific data cells (e.g., paragraphs of text). If set to Auto, Redivis will determine the value based on analysis of the beginning of the file.

It is best to only set this value to "Yes" if you know your data contain line breaks, as this will slow down the import and may cause incorrect error reporting.

Delimiter

The delimiter will be auto-inferred based upon an analysis of the file being uploaded. In rare cases, this inference may fail; you can specify the delimiter to override this inference.

Quote character

Specify the character used to escape delimiters. Generally " , though some files may not have a quote character (in which case, they must not include the delimiter within any cells).

Escape character

Cells containing a quote character must have that character escaped.

Typically, the escape sequence character is the same as the quote character, but some files may use a different value, such as a backslash (\).

Variable names and types

Naming variables

Variable names are automatically inferred from the source data. They can only contain alphanumeric or underscore characters, and must start with a letter or underscore. Any invalid characters will be replaced with an underscore (_).

Variable names must be unique within the table. If the same variable is found more than once in any given file, it will automatically have a counter appended to it (e.g., "variable_2").

The max number of characters for a variable name is 60. Any names with more characters will be truncated.

Variable type inference

Please note the following rules:

  • If all values of a variable are null, its type will be string

  • Numeric values with leading zeros will be stored as string in order to preserve the leading zeros (e.g., 000583 )

  • Data stored with decimal values will be stored as a float , even if that value is a valid integer (e.g., 1.0 ).

  • Temporal data types should be formatted using the canonical types below. Redivis will attempt to parse other common date(time) formats, though this will only be successful when the format is unambiguous and internally consistent.

    • Date: YYYY-[M]M-[D]D

    • DateTime: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]

    • Time: [H]H:[M]M:[S]S[.DDDDDD]

Working with multiple uploads

You can create up to 500 uploads per table, per version. Files will automatically be appended to each other based on their variable names (case insensitive), with the goal of creating one continuous table with a consistent schema.

Missing variables

If a variable is missing in some of the files you uploaded, the values for the missing variable will be set to null for all rows in the upload.

Conflicting variable types

Error handling

A file may fail to import due to several reasons; in each case, Redivis endeavors to provide a clear error message for you to fix the error.

In order to view full error information, including a snapshot of where the error occurred in your source file (when applicable), click on the failed upload in the upload manager

Network issues

When transferring a file from your computer (or more rarely, from other import sources), there may be an interruption to the internet connection that prevents the file from being fully uploaded. In these cases, you should simply try uploading the file again.

Invalid or corrupted source data

Common import errors

The Redivis data import tool has been built to gracefully handle a wide range of data formats and encodings. However, errors can still occur if the source data is "invalid"; some common problems (and their solutions) are outlined below.

If you're still unable to resolve issue, please don't hesitate to reach out to support@redivis.com; we'd be happy to assist!

Bad CSV dump from SQL database

Some SQL databases and tutorials will generate invalid CSV escape sequences by default. Specifically:

The "proper" escape sequence is a doubling of the quote character. For MySQL, this would look like

If you only have access to the invalid file generated by a previous database dump, you can specify a custom Quote Character of \ in the advanced import options, and Redivis will reformat the file as part of the ingest process (Redivis will also auto-detect this custom escape sequence in many scenarios). Using a custom escape sequence may cause data import processing to take a bit longer.

Line breaks within cells

If your data has paragraphs of text within a particular data cell, and the "Has quoted newlines" advanced option isn't set, the data import may fail (see above screenshot for example). Redivis will automatically set this option to true if it identifies a quoted newline in the top ~1000 records of the file, but if quoted newlines don't occur until later, you'll need to make sure to set this option manually for the import to succeed.

Connectivity and timeout errors

While rare, it is always possible that data transfers will be interrupted by the vagaries of networking. If this happens, we recommend simply retrying your upload. If the problem persists, please reach out to support@redivis.com.

Unstructured data

Overview

Uploading files

Quotas & limits

Folder management

You can create folders on the grey bar on the right side of the Files tab in a dataset. Folder names must be unique. You can click on the ⋮ menu next to any folder name to manage that folder.

To move files between folders you can right click on an individual file to change its folder. If you have multiple folders you can click the Move files button below files in the right bar to move files between folders.

Metadata

Overview

  • SAS (.sas7bdat): labels

  • Stata (.dta): labels and value labels

  • SPSS (.sav): labels and value labels

For other file types (e.g., csv), you will need to augment the metadata directly. To apply metadata in bulk, you can upload a file containing metadata information directly from your computer. This file can either be a CSV or JSON.

Is your metadata stuck in a PDF? We're truly sorry — if you can, please let the data provider know that it is essential that they provide metadata in a machine-readable format; hopefully in time this will change.

CSV metadata format

The CSV should be formatted without a header, with each row corresponding to a variable, with column 1 as the name, 2 as the label, 3 as the description. If the variable doesn't have a label or description, leave these columns empty.

For example:

JSON metadata format

When uploading a JSON file, specify the name, label, description, and valueLabels using the appropriately named attributes in the object corresponding to each variable. If the variable doesn't have a label, description, or value labels, you don't need to include these attributes.

For example:

To upload value labels in bulk, you must use the JSON format. We no longer support bulk upload of value labels via CSV.

Data sources

When you the enable the data source, you'll be prompted to log into the corresponding account. Redivis will only ever read data from these sources when explicitly requested, and it will never modify or overwrite content.

Once configured, you'll see any added data sources appear as an option when uploading data.

Available data sources

Google Cloud Storage

You may import any object that you have read access to in GCS by specifying a bucket name and path to that object, in the form /my-bucket/path/to/file. You may import multiple objects at once by providing a prefix followed by wildcard characters, e.g.: /my-bucket/my-folder/* .

The following wildcard characters are supported:

  • * : Match any number of characters within the current directory level. For example, /my-bucket/my-folder/d* matches my-folder/data.csv , but not my-folder/data/text.csv

  • ** : Match any number of characters across directory boundaries. For example, my-folder/d** will match both examples provided above

  • ? : Match a single character. For example, /my-bucket/da??.csv matches /my-bucket/data.csv

  • [chars] : Match any of the specified characters once. For example, /my-bucket/[aeiou].csv matches any of the vowel characters followed by .csv

  • [char range] : Match any of the range of characters once. For example, /my-bucket/[0-9].csv matches any number followed by .csv

Amazon S3

You may import any object that you have read access to in S3 by specifying a bucket name and path to that object, in the form /my-bucket/path/to/file. You may import multiple objects at once by providing a prefix followed by a wildcard character, following the same syntax and rules as outlined for Google Cloud Storage above.

Google Drive

You may import any file of valid format that you have stored within your Drive, including Google Sheets. Upon choosing as your import source, a modal will open that will allow you to browse and select files from your Google Drive.

Google BigQuery

You may import any table that you have read access to in BigQuery, including views, materialized views, and external tables. You must specify the table in the form project.dataset.table . To import multiple tables within a dataset, you may use wildcards. E.g., project.dataset.* or project.dataset.prefix* .

Box

You may import any file of valid format that you have stored within Box. Upon choosing as your import source, a modal will open that will allow you to browse and select files from your Box.

OneDrive

Coming soon. Please contact support@redivis.com if this integration would be helpful for your use case so that we can prioritize.

Redivis

You must both have data access and the ability to export any table or file that you import.

Importing tables

Importing files

granted access
assigned a DOI

Files are data entities uploaded to on Redivis. Files are used to store non-tabular (a.k.a. unstructured) data of any file type. You can view a dataset's files by clicking on the Files tab of any dataset or in a workflow.

If you have a file type that you think should be supported that isn't, please !

You can view on the Redivis Demo organization.

By representing files within an index table, we can query and subset the files , while joining them with other tabular metadata. In this way, file index tables behave like any other tabular metadata on Redivis.

To do deeper analysis of files, we can for further analysis. Consult the documentation for your preferred programming language to learn more:

Getting data into Redivis is the first and most important step in making it accessible for your research community. Redivis is designed to make it easy for you to securely ingest data, at scale. To begin the data upload process, first navigate to the .

For a guided walkthrough of uploading data to a dataset, please see the guide.

For a guided walkthrough of how to clean tabular data, please see the guide.

All datasets on Redivis are owned by either an individual user or an .

If you accidentally create a dataset in the wrong context, you'll need to to the correct context.

To create a dataset owned by you, navigate to , and under Datasets, choose to create a new dataset.

To create a dataset owned by an organization that you administer, navigate to the organization's , and under Datasets, choose to create a new dataset.

For step-by-step guides for creating and editing datasets, see the accompanying

When first creating a dataset, you will be able to specify certain for the dataset. However, these access rules will only take effect once the dataset is published, and can be modified before the dataset's publication.

Dataset editors will be able to find this dataset in their on the Datasets tab.

[Organization datasets only]. If your organization has configured a , you can enable DOI issuance for your organization's datasets. Based on your organization's settings, this will either be default-enabled or default-disabled for new datasets.

The dataset's public will be persisted in its deleted state – meaning that if the dataset was previously visible, it will still be visible (but less discoverable) once deleted. Additionally, any users who explicitly had or access to the dataset prior to deletion will have their access persisted upon deletion.

On in a workflow

Every table has a table viewer where you can explore the data, given you have appropriate . There are three tabs where you can view the data and interact with it in different ways:

Variables view (requires )

Cells view (requires )

Query view (requires )

View this table's (if applicable)

this table (requires )

If calculated from a variable, that variable must have date, dateTime, or integer. If the variable is an integer, its values will be assumed to represent a year and must be in the range [0, 9999].

The Variables tab provides a searchable list of the table's variables. In order to view this tab, you will need to the corresponding table.

Clicking any row will display the variable's statistics panel, which shows some common calculated from the variable's data.

The Cells tab allows you to quickly preview and explore the table's data as a grid, regardless of the table's size. In order to view this tab, you will need to the corresponding table.

The order of records within a table is arbitrary and non-deterministic, unless the table is the result of a transform or query that specifies an .

If the order of records is considered to be part of the "data" in your table, you should create an additional "record number" variable that can be sorted on. Alternatively, you can upload your source files as , in which case an exact copy of the original file will be stored.

Workflow provide the most robust and reproducible mechanism for querying tables on Redivis. However, you can execute SQL queries directly on a table for when you want to perform a quick, one-off analysis. In order to perform a query, you will need to the corresponding table.

The Query tab on a table contains a SQL editor with scaffolding to reference that table. The query syntax follows the same rules as the . Additionally query results may not exceed 10GB – if you need to materialize larger results, use a transform for your query.

Variables are present in all on Redivis. Conceptually, they represent something that is being measured in the data. When viewed in the table cells view, variables are represented by the table's columns.

In order to view variables, you must have the source dataset(s) for the given table.

These characteristics will only be considered in search results if the user has to the underlying dataset.

Required. The data type of the variable. .

Optional. A map of each value in the data to a longer string of information. For example, a variable with records 0 , 1 , and 2 might have value labels for 0 = No, 1 = Yes, 2 = Don't know. Value labels will be shown inline on the of a table, alongside the in the univariate statistics.

Redivis automatically computes certain statistics for each variable, depending on that variable's type and number of distinct values. To view summary statistics, click on a variable in a .

The total number of values for this variable in the . Does not include any values.

The number of unique values in the table. Does not include as a value. (For example, a variable with values 0, 1, and null will have a distinct of 2.)

The percentage of values in the table which are not . Calculated by dividing the Count by the total number of records in the table.

For variables (integer, float, date, dateTime, time): the minimum and maximum of this variable.

For variables: the minimum and maximum length of all values.

The box plot uses the BigQuery method, which will estimate quantiles () for larger datasets. As such, the box plot values should only be used as an approximation.

When working in bulk, you can also .

You can export your metadata at any time (in either of the above formats) by clicking Export file in the variable metadata editor. Variable metadata is also available in JSON format via the .

A 64-bit (double precision) decimal value.

All values in brackets are optional. If your Date is not in this format, you can convert it using . Redivis will attempt to convert other , when possible.

All values in brackets are optional. If your DateTime is not in this format, you can convert it using . Redivis will attempt to convert other , when possible.

All values in brackets are optional. If your Time is not in this format, you can convert it using . Redivis will attempt to convert other , when possible.

Represents a Point, Multipoint, Linestring, MultiLinestring, Polygon, or MultiPolygon geometry, as specified by the .

You can create or manipulate geographic variables using various .

The is used to show cell values, though each geography cell can also be viewed on a map by clicking on that cell within a table.

Variable types can be converted within the interface in a . The following conversions are supported:

Use new variable method.

Use new variable method.

If your dataset belongs to an organization, you can issue a DOI (Digital Object Identifier) by configuring your . Any DOI issued for this dataset will remain in a draft status for seven days to allow for version unrelease. After seven days the DOI will become permanent.

Do you think a common license is missing? to let us know what you'd like to see here.

By default, all dataset documentation is visible to anyone with to the dataset. However, there may be some content in the documentation that is sensitive — for example, information about named variables that would require metadata access.

Additionally, information about the dataset's size and temporal range will be automatically computed from the Additional table documentation, as well as the , will be indexed and surfaced as part of the .

Featured workflows: How many workflows by organizations include this dataset.

In addition to uploading data through the browser interface, you can leverage the library, as well as the generic , to automate data ingest and data release pipelines. These libraries can be used for individual file uploads similar to the interface, as well as for streaming data ingest pipelines.

Geospatial data on Redivis behaves quite similarly to : each feature is ingested as a single row within a table, alongside any metadata for this feature. This approach mirrors tools like PostGIS, R spatial features, and geopandas, allowing you to query and join your geospatial data, at scale.

Redivis supports importing geospatial data from several common GIS formats: geojson, shp, shp.zip,kml, and parquet. Parquet files with geospatial metadata (often referred to as GeoParquet) are the most performant and robust option, though as a newer standard, these files are less common. For other geospatial file types, Redivis first converts the file to a geojson representation (using the relevant ), and then imports the geojson into a table.

Each feature will be imported as one row, with the geometry column containing the WKT representation for that feature. Additional feature properties will be mapped to variables in your table, with any nested properties flattened using the . separator. Note that Redivis only supports 2-dimensional, unprojected () geometries. Other projections might cause the import to fail, and any extra dimensions will be stripped during ingest. If you are uploading a .shp.zip that contains projection information, the geometries will automatically be reprojected as part of the import process.

Type
Description
Notes

In addition to uploading geospatial data using one of the formats listed above, you can also import geographic data encoded within a text-delimited file (e.g., a csv). In this case, the geographic data should be encoded as strings using the . This is also the same format used when exporting geography variables as a CSV. WKT in CSVs will be auto-detected during data ingest.

Limits for upload file size, max variables, and other parameters are .

In order to create new , or update existing tables with new data, you will need to ingest a valid tabular data file from a . Redivis supports numerous tabular data formats, with robust error handling and . You can also perform scripted imports via the .

Tables on Redivis can be made up of one or more tabular file uploads. The use case applies when data with the same general schema is broken up across files. For example, a dataset where there is a separate file for each year. In general, it's best to combine such files into a single table, as it is easier for researchers to query a single "tall" table than multiple tables (and Redivis's high-performance query engine keeps these queries quick, even at terabyte scale).

Type
Description
Notes

Full documentation on the limits for upload file size, max variables, and other parameters are .

All values of a variable must be compatible with its type. Redivis will automatically choose the most specific, for a variable, with string being the default type.

If files have conflicting across a given variable, the lowest-denominator type for that variable is chosen when the files are combined.

Data invalidity is most common when uploading text-delimited files, though it can happen with any file format. While some data invalidity errors may require further investigation outside of Redivis, others may be due to provided in the file upload process. When possible, Redivis will display ~1000 characters that are near the error in the source file, allowing you to identify the potential source of failure.

In order to upload unstructured data to a you'll need to upload it as a on the Files tab in the dataset uploader.

It is possible to upload tabular data as a file, but it will not be represented as a . This means it is not possible to preview the variable statistics, cells, or query interfaces, use it in a transform node, or control on different levels, and is generally not recommended.

From the Files tab of the , click the "Upload files" button to import one or many files. You can import files from a variety of , or perform uploads

Limits for upload file size and max files per dataset are .

When preparing your upload you will need to select a destination folder. All files must be in one folder. These folders will have a corresponding which will allow researchers to more easily work with the files in bulk. If you are uploading many files, your folder structure will likely be important – for example, a different folder for different categories of imaging files.

Redivis determines variable names and types during . Additionally, it will automatically parse certain metadata based on the uploaded file format:

While you can just upload the PDF to the dataset's , you'll be doing your researchers a huge service if you can add structured metadata to the variables. That might mean some manual copying and pasting from the PDF, or you could consider the various (and imperfect) online PDF to CSV conversion tools, or .

If you don't have the bandwidth, consider asking for your researchers to contribute by making them a .

By default, you may upload data from your local computer, a public URL, or from another dataset or workflow on Redivis. However, Redivis supports numerous integrations for data ingest across common sources. You'll need to enable data sources in your settings in order to import data they contain.

You can import any table, which can be particularly helpful with ETL workloads where you want to import a cleaned version of your data (). You can also import any into your table, supporting workflows where tabular data is initially loaded as a file before being loaded into a table.

You can reference any table on Redivis using the form user|organization.dataset|workflow.table. That is, specify its owner (a user or organization), its containing entity (a dataset or worfklow), as well as the table name, separated by periods.

All files on Redivis belong to a . To import a file, first specify the index table, followed by a forward slash (/), and then the file name. E.g.:

To import multiple files at once, you can use wildcard characters, following the same pattern rules as specified for . E.g.:

datasets
dataset node
let us know
example files
within a workflow
load these files in a notebook
dataset editor
Creating a dataset
Cleaning tabular data
Tabular data
Geospatial data
Unstructured data
...etc
Programmatically, via Python or the API
organization
your workspace
administrator panel
Create & manage datasets guide
access rules
workspace
DOI provider
table nodes
access
Dataset
order clause
unstructured data
tables
APPROX_QUANTILES
±0.57%, 95% CI
upload metadata directly
variables.list API endpoint
floating point
GeoJSON standard
new variable methods
WKT format
transform
workflow
organization to issue DOIs
Contact us
Learn more about dataset DOIs below.
summary statistics
table
variable metadata
metadata on its tables.
dataset discovery process
featured
Incorrect encoding:
val1,val2,"string with \"quotes\" inside"

Correct encoding:
val1,val2,"string with ""quotes"" inside"
SELECT .... 
INTO OUTFILE '/.../out.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"', ESCAPED BY '"'
LINES TERMINATED BY '\n';
Uploading data reference
variable1_name,variable1_label,variable1_description
variable2_name,variable2_label,variable2_description
sex,patient sex,patient's recorded sex
id,patient identifier,unique patient identifier
// JSON format is an array of objects, with each object representing a variable
[
    {
        "name": "sex",
        "label": "patient sex",
        "description": "patient's recorded sex",
        "valueLabels": [
            {
                "value": 1,
                "label": "Male"
            },
            {
                "value": 2,
                "label": "Female"
            }
        ]
    }
]
user_name.dataset_name.table_name/file_name.csv
user.dataset.table/prefix*.csv
copy over its content
Embedded view of a table with 2.7B records. Click around to interact!
continuous
string
redivis-python
REST API
tabular data
ogr2ogr driver
WGS84
Well-Known Text (WKT) representation
specified here
types
dataset
file
table
access
dataset editor
data sources
programmatically.
data upload
documentation
this python library
account workspace
example
uploaded files
type
Learn more about variable types
valid type
frequency table values
cells view
tables
supported data source
programmatic interface
type inference
multi-upload
incorrect options
index table
file index table
Google Cloud Storage above
exported
multiple common formats

.parquet

GeoParquet

.geojson

GeoJSON

.geojsonl, .ndgeojson .geojsons

Same as the .geojson specification outlined above, except each feature is given its own line. Importing .geojsonl (as opposed to .geojson) will be significantly faster.

.kml

Keyhole Markup Language

.shp

Shapefile

.shp.zip

Zipped ESRI shapefile directory

.csv, .tsv, .psv, .dsv, .txt, .tab, *

Delmited

Redivis will auto-infer the delimiter, or you can specify it manually. This is also the default format for files with missing file extensions.

.jsonl, .ndjson

A newline-delimited list of JSON objects, with one object on each line. Each objects' keys correspond to the variable names in the table.

.json

JSON

Must be a JSON array of objects, where each top-level object represents one row in the table. The keys of each top-level object correspond to the variable names in the table.

Importing newline-delimited JSON (see above) will be significantly faster and is recommended for larger files. If your file is formatted as GeoJSON, but has the ending .json, make sure to explicitly choose "GeoJSON" as the file format.

.avro

Avro format

Compressed data blocks using the DEFLATE and Snappy codecs are supported.

Nested and repeated fields are not supported.

.parquet

Parquet format

Nested and repeated fields are not supported.

.orc

Orc format

Nested and repeated fields are not supported.

.sas7bdat

SAS data file

Default formats will be interpreted to the corresponding variable type, and variable labels will automatically be imported.

User defined formats (.sas7bcat) are not support.

.dta

Stata data file

Variable labels and value labels will automatically be imported.

.sav

SPSS data file

Variable labels and value labels will automatically be imported.

.xls, .xlsx

Excel file

Only the first sheet will be ingested. The legacy .xls format will have all dates and times represented as dateTimes. Due to the variability of excel files, and inconsistencies in how excel internally represents dates, this formatted is typically not recommended if other options are available.

Google sheets

Sheets file stored in Google Drive

Only the first tab of data will be ingested.

Programmatic

Overview

You can interface with Redivis data programmatically using the Redivis R, Python, and Javascript client libraries, which allow external environments using these programming languages to connect to Redivis, facilitating integration with a wide variety of scientific libraries and paradigms.

These libraries leverage the underlying REST API, which can also be used if you want to connect to Redivis from another programming language.

Python

The Programmatic tab in the table export modal contains a code snippet to help you get started in working with Redivis data via Python. E.g.:

dataset = redivis.organization("Demo").dataset("iris_species")
table = dataset.table("Iris")

table.to_pandas_dataframe()
# 	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
# 0	33	5.2	        4.1	        1.5	        0.1	        Iris-setosa
# ...

# Other methods to read tabular data:
# table.to_arrow_batch_iterator()
# table.to_arrow_dataset()
# table.to_arrow_table()
# table.to_geopandas_dataframe()
# table.to_dask_dataframe()
# table.to_polars_lazyframe()

# Reading unstructured data from file index tables:
# table.list_files()
# table.download_files()

View the full redivis-python documentation here:

R

The Programmatic tab in the table export modal contains a code snippet to help you get started in working with Redivis data via R. E.g.:

dataset <- redivis$organization("Demo")$dataset("iris_species")
table <- dataset$table("Iris")

table$to_tibble()
# 	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
# 0	33	5.2	        4.1	        1.5	        0.1	        Iris-setosa
# ...

# Other methods to read tabular data:
# table$to_arrow_batch_reader()
# table$to_arrow_dataset()
# table$to_arrow_table()
# table$to_data_frame()
# table$to_data_table()
# table$to_sf_tibble()

# Reading unstructured data from file index tables:
# table$list_files()
# table$download_files()

View the full redivis-r documentation here:

REST API

The Redivis REST API provides an intuitive, resource-based approach to programmatically interfacing with Redivis. It opens up the ability to query, manipulate, and export data (and other Redivis resources) from within any programming language, supporting the workflows and tools that are familiar and best suited to your use case.

curl --header "Authorization: Bearer YOUR_ACCESS_TOKEN" \
    https://redivis.com/api/v1/tables/demo.ghcn_daily_weather_data.stations/rows?format=csv

View the full REST API documentation here:

Google Data Studio

Overview

Creating a new data source

1. Navigate to the Redivis Connector

2. Authorize your Google and Redivis accounts

Click the authorize buttons to 1) allow your Google account to interface with the Redivis connector; and 2) allow Data Studio to access your data in Redivis.

3. Specify your Redivis table

Finally, specify the table that you want to import. If the table belongs to a workflow, you should provide the name of the workflow and the username of the workflow owner. If the table belongs to a dataset, you should provide the name of the dataset as well as the username of the dataset owner.

Access and security

Data Studio allows you to share your report with other users, and even embed your report on external websites. If you are sharing restricted data, it is important that you set the "Data credentials" to "Viewer". This will enforce that every viewer of the report has access to the underlying Redivis table. If the credential is set to "Owner", the data in the report will be visible so long as you (the owner) have access to the data in Redivis.

Using Data Studio

Sampling

Overview

To configure sampling on your dataset, click the Configure sample button on the Tables tab of a dataset page.

Random sample

As a general rule, you should only use random samples if:

  • You have one table in your dataset, or

  • Researchers won't be joining multiple tables in your dataset together

If this isn't the case, consider sampling on a specific variable. Otherwise, as researchers join different tables together, they will start getting samples of a sample, since there is no consistent cohort of records between tables.

Sampling on a variable

For situations when you want researchers to be able to join tables within your dataset, consider generating a sample on a variable that exists in at least some of the tables in your dataset. Every value for this variable will have a 1% chance of being in the output set.

Importantly, this sampling is deterministic. This guarantees that the same values that fall in the 1% sample for one table will also occur in the 1% sample for another table in the same dataset. In fact, these sampled values will be consistent across Redivis, allowing researchers to even merge samples across datasets.

Note that the sample will be computed on the string representation of the variable. For example, if the value '1234' falls in the 1% sample, then we are guaranteed that the integer value 1234 will also fall within the sample. However, if this value is stored as a float (1234.0), it is unlikely to also fall in the sample, as the string representation of this float is '1234.0', which for the purposes of sampling is entirely different than the string '1234'.

When sampling on a variable, only tables with that variable will be sampled. This is useful for the case when some tables contain supplementary information to your primary cohort. For example, consider the case when your dataset has a "Patients" table, a "Hospitalizations" table, and a "Hospitals" table. We'd likely want to create a sample on the patient_id variable, which would create a 1% subset of patients and the corresponding hospitalizations for those patients. However, this wouldn't create a sample on the "Hospitals" table — which is what we want, given that the sample of patients is still distributed across a large number of hospitals.

If only some of the dataset's tables are sampled, users with sample access to the dataset will have data access to the sampled tables and data access to the unsampled tables. While this is likely necessary for researchers to meaningfully work with the dataset sample (see paragraph above), it may have ramifications for how you configure your access rules.

Download

Overview

Through the table export modal, you can easily export full data in several common file formats, as well as the table metadata.

In order to download a table on Redivis, navigate to that table on the dataset page or within your workflow. Click the Export table button at right to open the table export modal.

Download data

Downloading unstructured data files

If this is an index table that contains unstructured data, you will also be able to download a ZIP of all the files represented within that table.

Export types

Redivis supports the following export formats:

  • CSV

  • JSON-lines

  • Avro

  • Parquet

  • dta (Stata)

  • sav (SPSS)

  • sas7bdat (via CSV + load script)

As a general rule, be cautious when using Excel to open .csv files. Never re-save a .csv that’s been opened in Excel, as the data may be modified.

Metadata and SQL downloads

If your table is in a workflow, you can also download a SQL file (via the More menu at the top right of the workflow) that contains the complete code used to generate your output. This code represents a full traversal of your workflow tree, and captures the exact statement necessary to generate the output table at that point in time.

Exporting data

Overview

Redivis is not a closed ecosystem — its foundation is based on open APIs and standards, and it is designed to work and integrate with the wide variety of open source (and proprietary) analytical toolkits used by the research community.

Version control

Overview

Changes to documentation and metadata do not create a new version, though different versions do have independent documentation and metadata. For example, if a new version contains a new table, or new data, you will likely want to document this information separately from the previous version. However, if you only want to enrich the existing metadata on the current version, you can do so without creating a new version.

Semantic version tags

To help researchers better understand if the differences across versions, Redivis uses semantic versioning, of the form v[major].[minor] The first version of every dataset is v1.0. For subsequent versions, the tag will augment automatically depending on the changes being released.

  • Major update: Existing code may not run.

    • Triggered when variables in the new version are renamed, deleted, or retyped. Also occurs if any tables from the previous version were deleted.

  • Minor update: Existing code will generally run.

    • Changes are limited to adding / removing records, recoding variables, adding variables, and adding tables.

Version history

On any dataset page, you can view the current version tag next to the dataset title, and click on this tag to switch to view a full version history and switch to a different version of this dataset.

Creating a new version

When it's time to update a dataset's data, you'll want to create a new version. To do this, navigate to the dataset editor and click Create next version.

Before this version is released, it will be tagged as next. Only dataset editors will be able to see the next version on the dataset page and use it in their workflows.

A dataset can have up to 1,000 versions. If your use case exceeds this limit, consider creating a new dataset that imports the previous dataset's tables once this limit has been reached.

Version storage

This total size will be displayed in the dataset editor, alongside the size for the current version. For datasets with one version, this total size may be slightly larger than the current version, as Redivis stores certain metadata behind the scenes to support future versioning.

As new versions are created, Redivis will efficiently compute a row-level difference between the versions — only additions and updates to existing data will contribute to the dataset's total storage size, preventing data that is consistent across versions from being double-counted.

Adding or removing columns (or changing column types) won’t affect row uniqueness, as the underlying storage represents all values as strings. Only the storage size of the new column would be added.

Version unrelease

If the most recent version of a dataset has been released in the last 7 days, and there is no Next version already created, you'll have the option to unrelease it.

This will revert the dataset to the exact state it was in before the version was released. If anyone who is not a dataset editor has this version in a workflow, they will lose access to the data, though can revert to a previous version if it exists.

Version deletion

Any version of a dataset can be deleted as long as it is not the most recent released version. Deleting this version will permanently delete all metadata and data associated with it.

This version will no longer be available in any workflows, and researchers will lose access to any data referencing that table in their workflows. In order to continue with the dataset, researchers will need to change their analyses to a non-deleted version.

Deleting a version is permanent and can't be undone.

If you are deleting versions to reduce storage costs, be aware that Redivis stores data efficiently across versions – the storage used by a particular record will be deleted only if it is unique to the deleted version (or, if deleting a series of versions, if that record doesn't exist in any non-deleted version).

Unreleased and published datasets

When a dataset is first created it will be marked as unreleased.

As you work on the initial version of the dataset you can see the changes you make to the data in the version history modal. in this unreleased state you can add it to a workflow and analyze the data it contains. Note that the only people who can see it in the workflow are dataset editors. If you make changes to the data or delete it, your workflow will change instantly to reflect those changes.

To add an unreleased or unpublished dataset to a workflow, click the View dataset tab from the edit interface, and click the Analyze in workflow button. Since this dataset is not published it will not appear in the Add dataset interface in workflows.

Once you are ready to make your data available to non-editors, you will need to publish it. This will simultaneously release your version as well as make your dataset discoverable. As soon as it is published it will be accessible to all who meet its access rules.

If this is an organization dataset, it can be unpublished at any time. This will "unlist" the dataset so that it will not be listed on your organization home page and even users who meet the access requirements will not be able to see it. Anywhere it is used in a workflow it will instantly be made unavailable.

Unpublishing might be used if you need to temporarily halt usage of the dataset but don't want to disrupt all of the access rules.

Working with previous versions

When analyzing data in a workflow you can change any dataset to a previous version to instantly change every table to the corresponding table in that version. To do this, click on the dataset and click the Version button in the toolbar. Select the version you'd like to analyze and confirm. You can always switch versions at any point.

Normally you can only have one copy of dataset in a workflow but it's possible to add a second version if you'd like to compare them. In any workflow, click the + Add data button and locate the dataset. Right click on the dataset and choose to add another version to the workflow.

Google BigQuery

Overview

To export data to BigQuery, click Export, and then fill out the subsequent modal.

You'll need to authenticate a Google account, and then specify the BigQuery project, BigQuery dataset, and BigQuery table.

Once you confirm the export, your data will begin transferring to the specified BigQuery location and will provide a live-updating status of the transfer.

Workflows

Overview

Redivis workflows are high-performance, collaborative environments for analyzing data. In a workflow, you can combine datasets across Redivis, and build out your analytical workflow in a reproducible manner. Workflows can contain:

Google Cloud Storage

Overview

To export data to Cloud Storage, click Export, and then fill out the subsequent modal.

Once you confirm the export, your data will begin transferring to the specified Cloud Storage location and will provide a live-updating status of the transfer.

Documentation

Overview

Workflow documentation is viewed and edited from the workflow overview. To access the overview, click the workflow name in the middle of the black menu bar or click on the white space in the background of the workflow tree.

Rename

Rename your workflow by clicking the header at the top of the workflow information view. Workflow names will be visible to anywhere this workflow is referenced.

Abstract

You can add a short description of your workflow here that will be visible in workflow previews across Redivis, including if it is featured on an organization home page.

Provenance

Creators

This section automatically defaults to displaying the owner of the project. Workflow editors can add or remove anyone from this list. Anyone included here will be added to the citation generated for this workflow.

Contributors

This section automatically includes anyone who edited this workflow. Workflow editors can add or remove anyone from this list.

Citation

This section shows the automatically generated citation for this workflow in your chosen format. This can be copied or downloaded for use elsewhere.

Changes made to the "Creators" field will be reflected in this citation. Any DOI issued for this workflow will automatically be included in this citation.

Related identifiers

This section automatically includes any datasets or workflows referenced by this project, including data sources, study collaboration, or what this workflow was forked from. Workflow editors can add any related identifiers from outside of Redivis through links or DOIs, including DMPs, papers referenced, and more.

Bibliography

You can launch a bibliography which displays the citation of this workflow and every related identifier listed above.

Methodology

Document the details of your research aim or data cleaning strategy. You can also embed links or images.

Visibility / Access

You can give other users access to view or edit your workflow, or transfer ownership to another user in the Sharing section. You can also set this workflow's visibility and discoverability. Anyone viewing your workflow will need to have gained data access to any restricted datasets to view the relevant node contents.

Study

Tags

You can add up to 25 tags to your workflow, which will help researchers discover and understand it.

Usage

Issue a DOI

Any workflow editor can issue a DOI (Digital Object Identifier) for a workflow. A DOI is a persistent identifier that can be used to reference this workflow in citations. DOIs are issued through DataCite and do not require any configuration with your own or your organization's DataCite account.

Open the Provenance section and click Issue DOI. Once created, you will be able to see the DOI and view the record on DataCite.

Draft status

When DOIs are issued they enter a "Draft" status where the identifier is assigned but it has not been permanently created. All DOIs issued for workflows will remain in this draft status for seven days to allow for removal of the DOI.

Since DOIs are intended for public reference, they will not be issued for workflows that remain fully private.

Note that granting public access to a workflow does not grant access to any restricted data it contains. Any individual viewing the workflow will need to also gain data access to see workflow nodes that reference restricted data.

Data sources / Transforms / Notebooks

The dataset overview has additional tabs that list all the nodes in the workflow. You can click on any node in one of these lists to navigate directly to it.

These lists can be filtered by their relative location in the workflow, variables they contain, and additional metrics.

Workflow concepts

Overview

You can use workflows on Redivis to analyze any type of data, at any scale. Workflows allow you to organize your work into discrete steps, where you can easily validate your results and develop well-documented, reproducible analysis artifacts.

Creating a workflow

Workflow tree layout and operation

In a Redivis workflow steps you've taken are represented visually on the left side of the screen in a tree layout where you can build nodes and easily see their relationships. Each node type can take different actions. The workflow tree automatically expands and creates a layout of all the nodes in your workflow, keeping everything organized as it grows.

Node selection

Clicking on a node will allow you to see its information and/or execute changes. Your selected node will have a purple border in the tree and the right panel will update to show that node's contents. You can expand or collapse this right panel by dragging the center pane side to side.

Node connections

Most nodes in a workflow will be connected to others, and these connections are drawn as lines on the workflow tree. Thicker lines usually indicate a source node or a harder connection, while thin lines will show a join or more casual linkage.

When you click on a node to select it, the lines connecting upstream and downstream nodes will turn purple. For larger workflows this can make it easier to trace where the data in your selected node came from, or see what its eventual outputs are.

Nodes that have a direct source will have a large solid line connecting them.

Node types

Building a workflow

The main way to build your workflow is to add and edit nodes. You will start by adding data to your workflow, then add ways to cut, reshape, and analyze it.

Add data to a workflow

Add data editing nodes

To create a work node, click on a table and select either the transform or notebook button that appears beneath it. If the table already has a downstream node you can press the plus icon beneath it instead.

Transforms vs. notebooks?

Transforms are better for:

  • Reshaping + combining tabular and geospatial data

  • Working with large tables, especially at the many GB to TB scale

  • Preference for a no-code interface, or preference for programming in SQL

  • Declarative, easily documented data operations

Notebooks are better for:

  • Interactive exploration of any data type, including unstructured data files

  • Working with smaller tables (though working with bigger data is possible)

  • Preference for Python, R, Stata, or SAS

  • Interactive visualizations and figure generation

Copy and paste nodes

You can right click on any transform or notebook node in the workflow tree to copy it. Right click on any table node to paste the copied transform or notebook. The table you pasted it onto will become this node's source table. You may need to update the node it's new location's information for it to work properly.

Insert nodes

If you would like to place a node between other nodes you can click on a transform or notebook and select Insert transform.

If you have a transform copied to the clipboard you can insert it between other nodes by right clicking on a transform or notebook and selecting Paste copied transform above. This will insert both the transform and its output table into the branch of the workflow you've selected.

Split and combine transforms

All transforms can be split at the step level into two different transforms by clicking Split in any step's menu. Additionally, two transforms can be combined into one by right clicking on a table to Remove it.

You might want to split a transform above a tricky step to see what the output table would look like at that point in the process. This can be a key tool in troubleshooting any issues and understanding what might be going wrong.

After splitting a transform to check an output table, the next logical step might be to combine these two transforms back into one again. Or perhaps you have a string of transforms which you no longer need the output tables for and want to reduce the size of your workflow.

Delete nodes

Only nodes that you have intentionally created can be deleted or removed from the workflow. Output tables are created by running code and can't be removed.

To delete a node, right click on the node and select Delete.

When deleting a transform or notebook:

  • The transform or notebook and its output table will be deleted.

  • If the workflow tree has additional nodes downstream, the transform or notebook and its output table will be 'spliced' out, i.e. the upstream node nearest the deleted transform will be connected to the downstream node nearest to the deleted output table. This deletion will cause the next downstream transform to receive new input variables from the node that's directly upstream which will probably cause it to become stale and possibly invalid since variables it was previously referencing might not exist any longer.

When deleting a dataset:

  • The dataset and all downstream nodes will be deleted. If additional branches are joined into the branch downstream of the deleted dataset, those branches will be retained up to but not including the transform located in the deleted branch.

Since you can't undo a deletion, you'll receive a warning message before proceeding.

As you make changes in a workflow you will change the status of different nodes connected to it. These changes in status are shown in the left panel of the workflow to help you keep track of any changes.

Workflow-level actions

Run all

You can select the Map button on the left side of the workflow toolbar to begin a run of all stale nodes in the workflow. This will execute all transform and notebook nodes in a logical sequence to update the workflow completely.

Shift nodes

To shift a node, select the node and click the arrow that appears next to most nodes when selected. Shifting nodes is purely an organizational tool and it has no effect on the data produced in the workflow.

Navigate nodes

Along with clicking a node to select it, all nodes have a Source and Output navigation button on the right side of the workflow toolbar. You can click this button to jump directly to the immediate upstream or downstream node.

Fork the workflow

You have a couple options for how this workflow can be reused in other analyses. Click the Fork button in the toolbar to get started.

  • Add to another workflow

    • Select this option to choose a workflow you'd like to add this one to. This will be a linked copy that will instantly update as the original workflow is updated. This can be a very powerful tool in building out complex workflows that all reference the same source analysis or cohort.

  • Clone this workflow

    • This will create a duplicate copy of the workflow will have a link back to the original workflow on the overview page. It can be helpful to work from an exact duplicate of a workflow if you're a viewing a colleague's workflow and would like to test a similar data manipulation, or if you'll rely on similar data manipulations in a new research effort.

Saving

A complete version history is available for all transforms and notebooks in your workflow, allowing you to access historic code and revert your work back to a previous point in time.

Workflow states

As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress. More detailed information about each of these states can be found on each node page.

Stale and edited

These are the most important workflow-wide concepts to be aware of. Workflows are designed to be iterative and allow you to see outputs and go back to make changes and rerun for new results.

In order to keep track of any edits, transforms that unrun edits will be shown differently on the map than other transforms, with grey diagonal marks.

Any time data is changed, nodes that referenced that data will become stale (yellow background) until they are rerun on the new inputs.

Collaboration

Comments

Any node in a workflow can be annotated with a comment by any workflow editor. Comments are intended to be a space for conversation grounded in a specific area of a workflow. They can be replied to in threads by multiple collaborators and resolved when the conversation is complete.

Simultaneous editors

Multiple users with edit access can be working on a workflow at the same time. When this is the case you will see their picture in the top menu bar alongside your own and a colored dot on the workflow tree to the right of the node they currently have selected. When a notebook is started you will see any collaborators code edits in real time.

Embedding tables

Overview

Embeddable link

Navigate to any table on Redivis and click the Export table button, then navigate to the Embed tab.

Here you can copy the embed URL and paste it into any site that supports embed links. You can also copy code to use for embedding this table on your own page.

At any point you can click the View on Redivis link in the bottom right corner to view the original table on its original Redivis dataset page.

You can also view this table fullscreen by clicking the fullscreen icon on the bottom right overlay.

Access

access level
Export
transforms
common date formats on upload
common datetime formats on upload
common time formats on upload
table

The is a modern standard for working with column-oriented geospatial data. If available, this format is the most robust and performant way to ingest geospatial features into Redivis.

Assumes an object with a "Features" property, containing an array of valid geojson features. Each feature will be imported as one row, with additional properties mapped to columns in the table. Nested properties will be flattened using the . separator. Note that Redivis only supports 2-dimensional, unprojected () geometries. Other projections might cause the import to fail, and any extra dimensions will be stripped during ingest. See for more information.

Will be internally converted to .geojson (via ), and then imported as specified above.

Will be internally converted to .geojson (via ), and then imported as specified above. Note that the shapefile must use the (aka EPSG:4326) projection. If you have additional files associated with your shapefile (e.g., .shx, .proj, .dbf), create a .zip of this folder and import according to the .shp.zip specification below.

Many shapefiles will be collocated with additional files containing metadata and projection information. These files are often essential to parsing the shapefile correctly, and should be uploaded together. To do so, create a .zip directory of the folder containing your .shp file and supplemental files. The zip file must end in .shp.zip. These will then be converted to .geojson (via ), and imported as specified for the .geojson format.

If projection information is available, the source geometries will be reprojected into . If no projection information is available, your data must be projected as WGS84, or the import will fail. Note that only one layer can be imported at a time. If you have directory containing multiple shapefiles, create a separate .shp.zip for each layer.

See working with

(a.k.a. Newline-delimited JSON)

You can also use R and Python to interface with data directly on Redivis, using the integrated . Importantly, the R or Python code that you write in a notebook is the same as what you would write elsewhere, allowing you to easily port your analyses between Redivis and other systems.

Note that, if data exports are restricted, you might first need to before exporting to an external environment.

The table export modal Integrations tab allows visualization and analysis of your Redivis tables in . Data Studio provides comprehensive tooling for creating interactive visualizations and dashboards.

Using the , data from your Redivis table is dynamically mirrored in Google Data Studio, for live updating of visual dashboards according to your latest data.

The was generated using the hosted on Redivis.

To create a new data source, . You can also search for "Redivis" when creating a new data source to select the Redivis connector.

The Redivis Data Studio connector is subject to the Redivis and . If you don't yet have a Redivis account, you'll need to in order to use this connector.

Finally, dataset owners can assign to their datasets. Certain tables may be bound by these export restrictions and cannot be imported into DataStudio (for example, if the data can only be exported to a specific compute environment, or if there are export size limits that the table exceeds). In this case, you can return to Redivis and request an export exception for the table if appropriate.

Data Studio provides a point-and-click interface that can allow you to develop elegant, interactive visualizations in minutes. You can leverage a comprehensive library of default visualizations, or tap in to the growing body of .

Some researchers may want to dive right in, though we recommend taking a look at the for some helpful resource in getting started.

For datasets with large tables, it is often a good idea to include a 1% sample of the data, supporting faster exploratory queries as new researchers work to understand your data. Moreover, if a sample is configured, you will have the ability to separately from the full dataset.

Sampling is applied independently to each version of a dataset. You may modify the sampling methodology on a version at any time — even after it's been released — though keep in mind that this may affect researchers that are currently working with the dataset sample. As a best practice, it's good to configure and validate your sample before .

The simplest form of sampling, this will create a corresponding sample for every table in the dataset (including ). Every record / file will have a 1% chance of occurring in the sample.

If your dataset contains , you probably want to sample on either the file_name or file_id variables.

Learn more about controlling sample access in the .

The export modal gives you options to download the table and its metadata in common formats, as well as using or other interfaces. You can also export the entire SQL that was generated to create the table (see More menu in the top right of the workflow), facilitating reproducibility and reference in other systems.

If there are any restrictions around downloading the table or dataset, they will be listed here. If a restriction is preventing you from downloading the data, you will be able to , if allowed.

You can download any table on Redivis assuming that: 1) you have ; 2) there are no on the dataset, or your export is not in conflict with these restrictions.

To download the table's data, select your and click the Download button. Depending on the size of the table, and whether it was recently downloaded, it may take anywhere from a few seconds to several minutes to generate the file.

Exports are subject to the

On every table, you can download a tables metadata as a structured JSON file, containing information about the table as well as all its variables with their names, types, labels, descriptions, and value labels (when present). The schema of this file is the same as that returned by the . Use the Download metadata tab of the table export modal, shown below.

Redivis supports , allowing you to quickly load a table into the software of your choice. Moreover, its , and allow you to interact with and build upon Redivis data in the tools that are familiar and best suited to your use case.

Dataset owners have the ability to define on their datasets, which constrain the allowed export destinations, and can limit the size of exports or enforce specific administrator approval before exports can happen.

These restrictions are enforced whenever you are exporting data through an external tool, and may limit how you can export some of the more restricted datasets on Redivis. In some cases, you may be able to to allow the export of an otherwise restricted table.

All datasets on Redivis are automatically versioned. Any change to a dataset's data in or will require a new version. A version is a locked copy of the data, supporting future reproducibility and confidence in the persistence of researchers' data workflows.

If you are a dataset editor and have new data, find a mistake to correct, or would otherwise like to modify the existing content of a dataset, you can and a new version.

Within a workflow, you can change the version of a dataset by selecting the and clicking the Version button at top right. If there is a new version available, the dataset node will be highlighted to indicate that you might want to upgrade.

All versions of a dataset contribute to that dataset's total size, which in turn will count towards your or (depending on whether the dataset is owned by you or an organization).

The table export modal Integrations tab allows export of Redivis tables to .

which contain data across Redivis that you want to work with and have access to. These are usually datasets and can also be other workflows.

to join, filter, aggregate, and reshape tabular data. Transforms are particularly important when working with large tables, allowing you to query billions of records in seconds.

to leverage the variety of powerful tools available in Python, R, Stata, and SAS in a high performance compute environment.

created as outputs of your work that can be used to validate your queries, build further analysis, create new Redivis datasets, or be exported to other environments.

Workflows can be with collaborators as a version controlled artifact of your research or to collaborate with others in real time.

New to Redivis? Learn more in our guide.

The table export modal Integrations tab allows export of Redivis tables to .

You'll need to authenticate with a Google account, and then specify the Google Cloud Storage Bucket, Table name, and an optional prefix. In addition, you can choose one of several .

View and update information about this workflow's creators and contributors, citation, and any related identifiers detected by Redivis automatically or added by a workflow editor. This is also where you can for this workflow.

You can add your workflow to a in order to facilitate collaboration with others. For certain restricted datasets, your workflow will need to be part of an approved study in order to run queries.

You can see the data of last workflow activity, how often the workflow was or viewed (if it is a public workflow).

You can start referencing the DOI immediately while it is still in draft status since the final DOI will not change once it becomes permanent. After the seven day draft period the DOI will automatically become permanent if your workflow is set to be .

To create a workflow, navigate to a dataset that you're interested in working with and press the Analyze data in a workflow button. If you do not have to this dataset you may need to Apply for access first.

You can also create a workflow from the "Workflows" tab of your . Once you've created your workflow, you'll be able to add any dataset that you have access to.

represent datasets or workflows that have been added to your workflow that represent a copy of a data source elsewhere on Redivis.

are either tables within a dataset, or the resulting output table of an upstream computation.

are queries that are used to reshape data by creating new tables.

are code blocks (and their outputs) which are used to analyze data.

You can click the Add data button in the top left corner of the workflow toolbar to select a dataset or another workflow you want to add to this workflow. This will add a copy of the selected to the workflow and allow you to reference its tables.

Each data source can only be added to a workflow one time. You can add multiple data sources in bulk from this modal. By default all datasets are added at the current but you can right click on the dataset in this modal to select a different version to add.

All data cutting, reshaping, and analysis on Redivis happens in either a or a node. These nodes must be attached to a source table, so can only be created after you've added a dataset.

There are two mechanisms for working with data in workflows: and . Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.

Workflows are continuously saved as you work, and any analyses will continue to run in the background if you close your browser window. You can always navigate back to this workflow later from the "Workflows" tab of your .

Workflows are made to be a space for collaborative analysis. You can easily share to your workflow with colleagues.

Redivis can be embedded on any site that allows for embed links, which will allow someone to interact with the live table and underlying data (assuming proper data access), including viewing summary statistics, exploring the cells, and querying the data.

Public data will always be accessible when embedded to anyone viewing the page. However if the data is restricted, normal data will apply.

In order to view restricted data, users must have a and be signed in. If they need to sign in, they can do so when viewing the dataset in it's embedded context.

Once logged in, if they already have to the table it will become available to view and query. If not, they'll need to apply for access on the Redivis dataset page by clicking the Apply for access button.

GeoParquet specification
Newline-delimited GeoJSON
ogr2ogr
ogr2ogr
WGS84
ogr2ogr
WGS84
JSON-lines
Redivis notebooks
Google Data Studio
Redivis Data Studio connector
sample dashboard below
US Fires dataset
click this link to navigate to the Redivis Connector
Privacy Policy
Terms of Use
create one
export restrictions
community visualizations
DataStudio help documentation
unstructured data files
data access reference
integrations
programmatic
table.get API endpoint
data downloads in numerous formats
integrations
Python and R client libraries,
REST API
tables
files
usage quotas
organization billing
WGS84
working with geospatial data
text-delimited files
releasing a new version
file index tables
file type
release
create
self-documenting
forked
fork
state indications
Share your workflow
rerun
rerun
rerun
fork
rerun
dataset node
Google BigQuery
Data source nodes
Transform nodes
Notebook nodes
Table nodes
Analyze data in a workflow
Google Cloud Storage
study
workspace
Data source nodes
Table nodes
Transform nodes
Notebook nodes
data source
version
transform
notebook
transforms
notebooks
workspace
access
tables
access rules
Redivis account
metadata or data access
shared
file types
issue a DOI
forked
version controlled

Transform concepts

Overview

A transform is a series of steps detailing a data transformation that are executed in sequential order when the transform is run. These steps and a final variable selection create an output table that can be inspected and further transformed.

Transforms vs. notebooks?

Transforms are better for:

  • Reshaping + combining tabular and geospatial data

  • Working with large tables, especially at the many GB to TB scale

  • Preference for a no-code interface, or preference for programming in SQL

  • Declarative, easily documented data operations

Notebooks are better for:

  • Interactive exploration of any data type, including unstructured data files

  • Working with smaller tables (though working with bigger data is possible)

  • Preference for Python, R, Stata, or SAS

  • Interactive visualizations and figure generation

Steps

When running the transform, steps will be executed in order, building off of what directly came before them.

Steps have additional options in the ⋮ menu to help in the building process:

  • Collapse / expand

  • Annotate

  • Disable / enable

  • Reorder (this might change the outcome)

Output variables

Running transforms

Transforms are iterative and are designed to easily check your progress as you work. As long as a transform is valid, you can press the Run button to generate your output table. Click on this output table node in the workflow tree to see your results once the run is complete.

Invalid state

The transform interface will prevent you from running a transform that is invalid. This will be displayed with an invalid icon on the step or the part of a step where the issue is. Usually when you have an invalid alert it means some crucial information is missing and you can follow the invalid icons to find the locations you'll need to update.

History and revert

Every time you run the transform, a snapshot of the transform at that point in time is saved. You can click on History to view the log of these runs, and click on any snapshot to revert the transform to that point in time.

Your transform will then be in the Edited state and no changes will be made to the output table. Once you run it, this configuration will become a new snapshot at the top of the transform history list.

Checking the output

Common things to check might be total number of rows and variables in your output table, distinct and null counts of a key variable, or even running a query on the Query tab of the table to filter by a particular value.

If you find the table isn't what you want, you can go back to the transform to make changes and rerun it. Transforms were created to be iterative!

Additional tools for working with transforms

Name and rename

By default transforms are named with a number based on the order they were created. We suggest renaming your transforms with descriptive names so you and your collaborators can quickly find them later on. You can rename a transform by clicking on the title of the node, or on the Rename option in the transform ⋮ More menu.

If you have not specifically renamed the connected output table, renaming a transform will also rename this output table.

Change the source table

All transform nodes have one source table. You can join in multiple tables, but the source table designation will affect how the joins execute. You can change the source table from the ⋮ More menu in the transform. Note that if the new source table has different variables your transform might become invalid until you update it.

Split and combine transforms

All transforms can be split at the step level into two different transforms by clicking Split in any step's menu. Additionally, two transforms can be combined into one by right clicking on a table to Remove it.

You might want to split a transform above a tricky step to see what the output table would look like at that point in the process. This can be a key tool in troubleshooting any issues and understanding what might be going wrong.

After splitting a transform to check an output table, the next logical step might be to combine these two transforms back into one again. Or perhaps you have a string of transforms which you no longer need the output tables for and want to reduce the size of your workflow.

Copy, paste and insert

As you go, you might find yourself wanting to take certain data cleaning actions multiple times or move them around in your process. You can right click on any transform in the workflow tree to see options for copying it or deleting it. Right click on any table node to paste the copied transform. You can also insert transforms above other transforms.

Steps and parts of steps can also be copied and pasted within the same transform or across multiple transforms.

View SQL code

All transforms generate SQL, which will executed the transform is run. To view the code for a particular step, click on View SQL in the step menu. You can also see the SQL for the entire transform from within the ⋮ More menu of the transform.

You can convert this code into a SQL query step and run it to achieve the same outcome, or edit the code directly.

SQL Concepts

Redivis uses SQL as the basis for data transformation for many reasons. It is designed for working quickly and efficiently with large tables, and is easy to learn concepts and basic executions, and it is highly reproducible.

While Python, R, Stata, and SAS are commonly used in data analysis, they are not suited to working with large tables. Executing queries on millions of records can take hours and in many cases fail. We have provided a point and click interface for using transforms to make using a perhaps unfamiliar coding language easier.

Fundamentals

  • Tables are data structures containing rows and variables

  • Variables represent fields in the table that all rows contain values for

  • Rows represent an individual observation in table with values in across all variables

  • A SQL query will always operate on data from one or more tables and output a single data table.

Building a query

When you are using the interface or SQL code, it is best to start the process by figuring out what you would like your output table to look like. What are the variables? What do you want each row of the table to represent?

Once you figure out what you want your output to look like, you can work backwards and formulate the steps of your query to build the shape you'd like.

Tables

Overview

Table nodes in your workflow can either be generated by running transform, running code in a notebook to output a table, or will be part of a dataset you've added. The main purpose of table nodes in workflows is to store data which you can sanity check and further query.

All table nodes also have an associated export interface to see different ways that you can use this data off of the Redivis platform.

Usage

You can create multiple transforms to operate on a single table, which allows you to create various branches within your workflow. To create a new transform, select the table or dataset and click the small + icon that appears under the bottom right corner of the node.

Sanity check output

After you run a transform, you can investigate the downstream output table to get feedback on the success and validity of your querying operation – both the filtering criteria you've applied and the new features you've created.

Understanding at the content of an output table allows you perform important sanity checks at each step of your research process, answering questions like:

  • Did my filtering criteria remove the rows I expected?

  • Do my new variables contain the information I expect?

  • Does the distribution of values in a given variable make sense?

  • Have I dropped unnecessary variables?

Archival

If you haven't interacted with tables in your workflow for a while, these tables may become archived, which will temporarily limit your ability to view cells and query data in that table. This is done to prevent runaway storage costs, while leveraging the built-in reproducibility of workflows to allow you to unarchive the table and pick up where you last left off.

The archival algorithm prioritizes tables that are large, quick to regenerate, and intermediary (not at the bottom of the tree). It currently does not archive tables less than 1GB; in many cases you may never interact with archived tables.

Note that the transform immediately upstream (or any additional upstream transforms, if multiple sequential tables are archived) is invalid, you'll have to resolve the invalid state before un-archiving the table.

File index tables

Exporting

Table node states

As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress.

State
Display
Details

Empty

White background

A table node will be empty when it contains no data because the upstream node either has not been executed or its execution resulted in no data.

Executed

Grey background

A table node will be grey when it has data that aligns with the contents of its upstream node.

Stale

Yellow background

A node will be stale when an upstream change has been made. This means the content of the node does not match the content of the node above it.

Sampled

Black circle with 1% icon

Incomplete access

All black background, or dashed borders

You don't have full access the node. Click on the Incomplete access button in the top bar to begin applying for access to the relevant datasets.

Data sources

Overview

You can click on any table to view its contents, or click "Transform" to build a transform on it.

Adding data source nodes to a workflow

Adding a data source to a workflow will make a copy of that dataset or workflow in the form of a circular node at the top of your workflow tree. You can add data to the workflow by:

  • Click the + Add data button in the top left of the toolbar within a workflow.

  • Click the Analyze in workflow button on a dataset page.

  • Click the Fork button in the toolbar of any workflow.

Restrictions

No workflow can contain two copies of the data source or the same version of the same dataset. You can add a different version of the dataset to your workflow by right clicking on the dataset name in the + Add data modal.

Datasets as a data source

Dataset samples

If a 1% sample is available for a dataset, it will automatically be added to your workflow by default instead of the full sample. Samples are indicated by the dark circle icon to the top left of a dataset node in the left panel and in the list of the dataset's tables.

All sampled tables in the same dataset will be sampled on the same variable with the same group of values (so joining two tables in the same dataset with 1% samples will still result in a 1% sample).

​To switch to the full sample, click "Sample" button in the top right of the menu bar when you have a dataset selected.

Your downstream transforms and tables will become stale, since an upstream change has been made. You can run these nodes individually to update their contents, or use the run all functionality by clicking on the workflow's name in the top menu bar.

Dataset versions

You can view version diffs and select whichever version you want to use here.

After updating, your downstream transforms and tables will become stale. You can run these nodes individually to update their contents, or use the run all functionality by clicking on the workflow's name in the top menu bar.

Workflows as a data source

All workflow data sources are linked to their original workflow and will automatically update with any changes made to the original.

Data source node states

As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress.

State
Display
Details

Dataset type

Dataset icon in the middle of the circle

Workflow type

Workflow icon in the middle of the circle

Sampled

Black circle with 1% icon

Outdated version

Purple background on version number

Incomplete access

All black background, or dashed borders

You don't have full access the node. Click on the Incomplete access button in the top bar to begin applying for access to the relevant datasets.

Step: Aggregate

Overview

The Aggregate step collapses rows that are identical across a set of variables, optionally creating new aggregate variables in the process.

Example starting data:

/*---------+------------+---------+--------*
 | test    | date       | student | score  |
 +---------+------------+---------+--------+
 | quiz    | 2020-04-01 | jane    | 83     |
 | quiz    | 2020-04-01 | pat     | 35     |
 | midterm | 2020-05-01 | jane    | 74     |
 | midterm | 2020-05-01 | pat     | 62     |
 *---------+------------+---------+--------*/

Example output data

Collapsing on variables test and date, and creating new variable average_score to aggregate data from the score variable.

/*---------+-------------+---------------*
 | test    | date        | average_score |
 +---------+-------------+---------------|
 | quiz    | 2020-04-01  | 59            |
 | midterm | 2020-05-01  | 68            |
 *---------+-------------+---------------*/

Step structure

  • There is one collapse block where we will define how the data will be reshaped. On execution:

    • Your data will be cut to only include the variables chosen in this block.

    • Duplicates records across the collapsed variables will be dropped.

  • There can be one or more aggregation blocks where we can capture aggregate information in a newly created variable.

    • You can successfully collapse your table without creating any new variables in aggregation blocks.

    • Each aggregation block in the step represents one new variable in your output data.

    • Aggregation blocks how you can capture information about records dropped in your collapse block.

Field definitions

Collapse block:

Field
Description

Variables to collapse on

All variables you want to include in your output.

Aggregation block(s):

Field
Description

Name

The name of the new variable being created.

Aggregation method

How the new variable will be summarized (e.g. SUM or COUNT).

[Method fields]

Examples

Example 1: Basic collapse

We have test score data recorded per test, student, and date. However we want to know the average score on each test overall.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Variables to collapse on: Of the variables in our table, the only ones we want in our final output are test and date, so we select these here. We leave out student because that information doesn't matter to us anymore, and we leave out score because we are creating a new variable to replace it with aggregated information.

  • Name: We give our new variable a descriptive name average_score.

  • Aggregation method: We want to average all values in the grades variable per test, so we choose Average.

  • Variable to aggregate: Here is where we choose score as the variable containing the data we want to Average. If we had chosen a different Aggregation method, we might have different input fields here to answer.

Execution:

All variables are removed that aren't collapsed on or used in aggregation.

/*---------+-------+------------*
 | test    | score | date       |
 +---------+-------+------------+
 | quiz    | 83    | 2020-04-01 |
 | quiz    | 35    | 2020-04-01 |
 | quiz    | 89    | 2020-04-01 |
 | midterm | 74    | 2020-05-01 |
 | midterm | 62    | 2020-05-01 |
 | midterm | 93    | 2020-05-01 |
 | final   | 77    | 2020-06-01 |
 | final   | 59    | 2020-06-01 |
 | final   | 92    | 2020-06-01 |
 *---------+-------+------------*/

The Average is created of the score where test and date (our collapsed on variables) have the exact same values.

/*---------+-------+------------+---------------*
 | test    | score | date       | average_score |
 +---------+-------+------------+---------------+
 | quiz    | 83    | 2020-04-01 | 69            |
 | quiz    | 35    | 2020-04-01 | 69            |
 | quiz    | 89    | 2020-04-01 | 69            |
 | midterm | 74    | 2020-05-01 | 76.3333       |
 | midterm | 62    | 2020-05-01 | 76.3333       |
 | midterm | 93    | 2020-05-01 | 76.3333       |
 | final   | 77    | 2020-06-01 | 76            |
 | final   | 59    | 2020-06-01 | 76            |
 | final   | 92    | 2020-06-01 | 76            |
 *---------+-------+------------+---------------*/

The score variable is removed, since it was not collapsed on.

/*---------+------------+---------------*
 | test    | date       | average_score |
 +---------+------------+---------------+
 | quiz    | 2020-04-01 | 69            |
 | quiz    | 2020-04-01 | 69            |
 | quiz    | 2020-04-01 | 69            |
 | midterm | 2020-05-01 | 76.3333       |
 | midterm | 2020-05-01 | 76.3333       |
 | midterm | 2020-05-01 | 76.3333       |
 | final   | 2020-06-01 | 76            |
 | final   | 2020-06-01 | 76            |
 | final   | 2020-06-01 | 76            |
 *---------+------------+---------------*/

Then all exact duplicate records are dropped, to create the output.

Output data:

/*---------+-------------+---------------*
 | test    | date        | average_score |
 +---------+-------------+---------------|
 | quiz    | 2020-04-01  | 69            |
 | midterm | 2020-05-01  | 76.3333       |
 | final   | 2020-06-01  | 76            |
 *---------+-------------+---------------*/

Example 2: Multiple aggregation variables

We can build on our previous example, but say we want to know additionally how many students took each test.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

2nd aggregation block

  • Name: We want to give the new variable a descriptive name unique from our other block (and other variables in our table), in this case test_count.

  • Aggregation method: We want to count all values in the test variable per date, so we choose Count.

  • Variable to count: We choose test as the variable containing the data we want to Count. Since none of our variables have null entries, we could choose any variable here and get the same result. If we did have nulls, they would not be included in the Count. Conceptually we do not want to only include distinct values in our count so we leave that off

Output data:

/*---------+-------------+---------------+-------------*
 | test    | date        | average_score | test_count  |
 +---------+-------------+---------------+-------------+
 | quiz    | 2020-04-01  | 69            | 3           |
 | midterm | 2020-05-01  | 76.3333       | 3           |
 | final   | 2020-06-01  | 76            | 3           |
 *---------+-------------+---------------+-------------*/

Example 3: Drop duplicates

This step can be used to drop duplicated records even in cases where no aggregation happens.

Let's say we have data that we know had duplicate records that we don't need.

Starting data:

/*---------+-----------+------------*
 | test    | questions | date       |
 +---------+-----------+------------+
 | quiz    | 10        | 2020-04-01 |
 | quiz    | 35        | 2020-04-01 |
 | quiz    | 10        | 2020-04-01 |
 | midterm | 20        | 2020-05-01 |
 | midterm | 20        | 2020-05-01 |
 | midterm | 20        | 2020-05-01 |
 | final   | 45        | 2020-06-01 |
 | final   | 45        | 2020-06-01 |
 | final   | 45        | 2020-06-01 |
 *---------+-----------+------------*/

Input fields:

Variables to collapse on: To drop all records that are an exact duplicate across all variables, we just need the collapse block with no aggregation blocks. We need to select all variables here and can do so by typing all of them out, or inputting *.

Output data:

/*---------+-----------+------------*
 | test    | questions | date       |
 +---------+-----------+------------+
 | quiz    | 10        | 2020-04-01 |
 | midterm | 20        | 2020-05-01 |
 | final   | 45        | 2020-06-01 |
 *---------+-----------+------------*/

Transforms

Overview

A transform node in a Redivis workflow provides a powerful querying tool to reshape one or more tables into a new output table.

You will build your transform by choosing steps one at a time based on the actions you want to take on your data. At any point you can run the transform to execute each step sequentially and view the resulting table.

The most commonly used transforms steps can:

Transform node states

As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress.

State
Display
Details

Empty

White background

A run has never been attempted.

Executed

Grey background

A run has successfully completed and no changes have been made since then.

Invalid

Black exclamation icon

A run is not possible. This might be because you haven't finished building the steps or have input invalid information.

Errored

Red exclamation icon

A run has finished unsuccessfully. This might be due to an incorrect input you've set that our validator can't catch. Or something might have gone wrong while executing and you'll just need to rerun it.

Edited

Grey hash marks

Changes have been made to the contents of the transform since it was last run. You can either Run this transform or Revert to its previously run state to resolve it. Editing a transform makes the downstream table stale.

Stale

Yellow background

An upstream change has been made. This means the content of the node does not match the content of the node above it.

Queued

Static circular icon

Nodes that are waiting for an upstream run to finish before running themselves. When you run multiple nodes they will be executed in a logical order. Once upstream dependencies are finished, queued nodes will automatically start running.

Running

Spinning circular icon

A run is currently in progress. You can cancel running (or queued) transforms by clicking the Run menu in the top bar and selecting Cancel. If a node is currently running it might not be able to cancel, depending on what point in the process it's at.

Incomplete access

All black background, or dashed borders

You don't have full access the node. Click on the Incomplete access button in the top bar to begin applying for access to the relevant datasets.

Step: Create variables

Overview

A Create variables block uses methods to make new variables based on existing data. The method selected will dictate how the block operates.

Example starting data:

Example output data:

Creating a new variable for what letter grade each student got on their test.

Step structure

  • There will be at least one new variable block where you will define and complete a new variable method.

  • When multiple blocks exist, the variables will be created in sequence and can reference each other.

Field definitions

Analytic methods

Analytic methods are a special category of method that allow for each row to be computed individually. When using an analytic new variable method new tools become available:

  • A partition segments data based on values in the selected partition variables, and computes the analytic method within those segments separately.

  • A window defines which records are used to compute the analytic method. Usually this would be accompanied with an order clause.

Examples 3 and 4 below go into more detail for analytic methods.

Examples

Example 1: Date extract

A simple new variable format is extracting one part of a variable into a new one. In our data, we have a full date including year, month, and day, but we want to extract the year for use elsewhere.

Starting data:

Input fields:

  • Name: We name our new variable test_month.

  • Method: Once we chose Date extract as our method, new fields appear

    • Variable to extract: The variable with the data we want to extract from (date).

    • Date part: The part of the variable we want to extract. Since date is a Date type variable the information about date part is stored in the format and can be easily extracted. We choose Month since that is what we want to create.

Output data:

Example 2: Case method

The Case method (if/else) allows us to specify one or more conditions to create the values of the new variable.

For example, we can create a variable capturing the grade of each test in our data.

Starting data:

Input fields:

  • When run, each row of the table will be evaluated in each section of the case statement (If, Else if, Else set to) until it matches something and is set to the corresponding value.

    • After the first section where a row meets the criteria, the new variable value will be set and no other sections will be evaluated.

Output data:

Example 3: Partitioned analytic methods

We can use an analytic method to compute a value for each row (rather than the entire table). By using a partition we can define groups of rows to calculate across.

For example, in our grades data we can calculate the average score of each test.

Starting data:

Input fields:

  • Method: We want to calculate the average so select this analytic method.

  • Partition: This is where we define the groups in which the average will be calculated.

    • If one or more variables are selected here, the average for our new score_average variable will be computed across all rows that are the same in the selected variable(s).

    • If no variables are entered here, then the average will be computed across the entire table.

    • We want to average scores from the same test so we select test here.

  • Variable to aggregate: This variable contains the data we want to average.

  • Window: We don't want to define a window in this average so we leave it on All rows in the partition.

Output data:

To create the new variable, for each row with the same value in test, the values in score will be averaged together. So there are 3 rows with the value quiz in test which average to 69. So all rows with a quiz will show the same average of 69 in our new variable.

Example 4: Windowed analytic methods

We can use a window to calculate a moving average. A window will define how many rows before or after the current row to use when calculating the average.

Example data:

Input fields:

  • Partition: We choose not to use a partition in this example since our data does not need to be segmented.

  • Variable to aggregate: The numbers we want to average are in the score variable so we choose that here.

  • Window: We want to create a moving average based on one entry before and after the current row, so we select Rows here.

  • Order by: Our data is already ordered by date in this table, but if it wasn't we would definitely need to order on the relevant variable here.

  • Rows preceding / Rows following: This is where we define how many rows to include in the average.

Output data:

For each row, one row preceding and following is used to compute the score_average. So for the first row we average 10 and 10 (since no preceding rows exist, it is excluded). For the second row 10, 10, and 40 are averaged. This process repeats until the end of the table is reached.

Note that we could also use Range instead of Rows for our window if our question was time based (e.g. average score over 1 month preceding and 1 month following).

Example 5: Windowed & partitioned analytic methods

Continuing the previous example, if we had included a partition then this same process would be completed separately for the values in each partition. So if we had partitioned on test in this example, our outcome would look different

Output data:

Since quiz2 is in a separate partition, those rows are averaged separately.

fork
fork
fork

There are two mechanisms for working with data in workflows: and . Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.

The majority of building a transform is choosing and completing your data transformation steps. Steps represent one specific action (e.g. , ) and are completed in the Redivis transform interface. Every step is a fundamentally a piece of SQL code that can be examined and could be reproduced by running in any environment with the same data.

The final task in constructing a transform is to select which variables you want to have in your output table. You will use the in the bottom pane of the transform to choose which to keep.

Output tables don't just exist to store data, you can use the variable statistics generated in the to make sure your transform did what you expected it to do.

Derivative tables in a workflow behave quite similarly to throughout Redivis, where you can preview cells, view summary statistics, and run quick queries.

All table nodes have one upstream parent. You can view the table's data and metadata similarly to other on Redivis. You cannot edit or update the metadata here.

To sanity check the contents of a table node, you can inspecting the general characteristics, checking the of different variables, looking at the table's , or create a for more in-depth analysis.

If a table is archived, you can still see the name, row count, and variable names/types. To access , , or downstream of an archived table, you'll have to reconstitute the table by re-running upstream transforms.

If you've added a dataset to a workflow that contains (storage for unstructured data) you will see a table with a label in that dataset's list of tables. This is an automatically generated table, where every row represents one file. You can work with this table just like any other in the workflow, but the file_id variable will remain linked to the files in that dataset for use in a .

Click the Export table button in the table's right panel to see your and manage any access restrictions associated with data this table is derived from.

This means that you are using a 1% of the data. When a dataset has a sample, it will automatically default to it when added to a workflow. You can change this to the full sample and back at any time in the .

A data source node contains data across Redivis that you want to work with and have access to. These are usually but can also be other workflows. These nodes display overview information about the dataset or workflow it represents, and a list of the tables it contains.

are the most common Data source you will add to your workflow. They contain the data you want to work with in the original state curated by the data owner. Usually datasets will contain 1 or more tables you can choose to transform or analyze in a notebook.

Some large datasets have 1% which are useful for quickly testing querying strategies before running transforms against the full dataset.

When a new of a dataset is released by an administrator, the corresponding dataset node on your workflow tree will become purple. To upgrade the dataset's version, click the "Version" button in the top right of the menu bar when you have a dataset selected.

can be added to another workflow in order to build off existing analysis. You might want to continue an analytical pipeline that you've built elsewhere, or elaborate on someone else's analysis. You will have access to all tables this workflow contains.

This data source is a copy of a on Redivis.

This data source is a copy of another on Redivis.

Only possible for dataset source nodes. This means that you are using a 1% of the data. When a dataset has a sample, it will automatically default to it when added to a workflow. You can change this to the full sample and back at any time in the dataset node

Only possible for dataset source nodes. For datasets this means that you are not using the latest . This means that you have either intentionally switched to using an older version, or that this dataset's administrator has released a new version that you can to.

After you select your aggregation method, you will be prompted to input the information your chosen method needs to execute. You can see more specifics in the section.

a table

based on existing data

out rows you no longer need

additional tables

The transform interface translates your interface inputs to SQL code that you can view and edit. You can also intermix SQL code you've written between interface steps, or build an entire transform of your own if you wish.

See our guide for a step-by-step walkthrough of using transforms on Redivis.

Field
Definition

Some methods are only available for certain variable types, so you might need to variables before you can use them in the method you've chosen.

More methods to choose from can be found in the .

Comparison statements used here will operate the same as they do in the step and can be nested in the same way.

transforms
notebooks
filter rows
join tables
variable selector
table node
dataset tables
tables
exporting options
datasets
Datasets
samples
version
Workflows
Aggregate
Create variables
Filter
Join
SQL code
Reshape tables in transforms
run transforms
variable summary statistics
view cells
table
summary statistics
cells
notebook
files
notebook
File index
/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/
/*---------+--------+-------*
 | student | score  | grade |
 +---------+--------+-------+
 | jane    | 83     | B     |
 | neal    | 35     | F     |
 | sam     | 74     | C     |
 | pat     | 62     | D     |
 *---------+--------+-------*/
/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/
/*---------+-------+---------+------------+--------------*
 | test    | score | student | date       | test_month   |
 +---------+-------+---------+------------+--------------+
 | quiz    | 83    | jane    | 2020-04-01 | April        |
 | quiz    | 35    | pat     | 2020-04-01 | April        |
 | quiz    | 89    | sam     | 2020-04-01 | April        |
 | midterm | 74    | jane    | 2020-05-01 | May          |
 | midterm | 62    | pat     | 2020-05-01 | May          |
 | midterm | 93    | sam     | 2020-05-01 | May          |
 | final   | 77    | jane    | 2020-06-01 | June         |
 | final   | 59    | pat     | 2020-06-01 | June         |
 | final   | 92    | sam     | 2020-06-01 | June         |
 *---------+-------+---------+------------+--------------*/
/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/
/*---------+-------+---------+------------+----------------*
 | test    | score | student | date       | letter_grade   |
 +---------+-------+---------+------------+----------------+
 | quiz    | 83    | jane    | 2020-04-01 | B              |
 | quiz    | 35    | pat     | 2020-04-01 | F              |
 | quiz    | 89    | sam     | 2020-04-01 | B              |
 | midterm | 74    | jane    | 2020-05-01 | C              |
 | midterm | 62    | pat     | 2020-05-01 | F              |
 | midterm | 93    | sam     | 2020-05-01 | A              |
 | final   | 77    | jane    | 2020-06-01 | C              |
 | final   | 59    | pat     | 2020-06-01 | F              |
 | final   | 92    | sam     | 2020-06-01 | A              |
 *---------+-------+---------+------------+----------------*/
/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/
/*---------+-------+---------+------------+---------------*
 | test    | score | student | date       | score_average |
 +---------+-------+---------+------------+---------------+
 | quiz    | 83    | jane    | 2020-04-01 | 69            |
 | quiz    | 35    | pat     | 2020-04-01 | 69            |
 | quiz    | 89    | sam     | 2020-04-01 | 69            |
 | midterm | 74    | jane    | 2020-05-01 | 76.3333       |
 | midterm | 62    | pat     | 2020-05-01 | 76.3333       |
 | midterm | 93    | sam     | 2020-05-01 | 76.3333       |
 | final   | 77    | jane    | 2020-06-01 | 76            |
 | final   | 59    | pat     | 2020-06-01 | 76            |
 | final   | 92    | sam     | 2020-06-01 | 76            |
 *---------+-------+---------+------------+---------------*/
/*-------+----------+------------*
 | score | student  | date       |
 +-------+----------+------------+
 | 10    | quiz1    | 2020-01-01 |
 | 10    | quiz1    | 2020-02-01 |
 | 40    | quiz1    | 2020-03-01 |
 | 30    | quiz2    | 2020-04-01 |
 *-------+---------+------------*/
/*-------+---------+------------*---------------+
 | score | test    | date       | score_average |
 +-------+---------+------------+---------------+
 | 10    | quiz1   | 2020-01-01 | 10            |
 | 10    | quiz1   | 2020-02-01 | 20            |
 | 40    | quiz1   | 2020-03-01 | 26.6667       |
 | 30    | quiz2   | 2020-04-01 | 35            |
 *-------+---------+------------+---------------*/
/*-------+---------+------------*---------------+
 | score | test    | date       | score_average |
 +-------+---------+------------+---------------+
 | 10    | quiz1   | 2020-01-01 | 10            |
 | 10    | quiz1   | 2020-02-01 | 20            |
 | 40    | quiz1   | 2020-03-01 | 25            |
 | 30    | quiz2   | 2020-04-01 | 30            |
 *-------+---------+------------+---------------*/
analytic method
sample
dataset node
dataset
workflow
sample
version
Variable creation methods
retype
Variable creation methods
Filter
transform SQL query syntax

Name

The name of the variable being created. This must follow all naming standards

Method

Step: Order

Overview

The Order step sorts the table based on one or more variables.

Example starting data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Order by score descending

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | sam     | 74     |
 | pat     | 62     |
 | neal    | 35     |
 *---------+--------*/

Ordering large tables (>1GB) across variables with a large number of unique values requires substantial memory and isn't parallelizable. Order clauses in such cases may significantly slow your transform or cause it to fail.

Step structure

  • There will be at least one order block where you will define a variable and a sort order.

  • When multiple blocks exist, the variables will be ordered and then sub-ordered in sequence.

Input field definitions

Field
Definition

Order by

The variable containing the values that will be sorted.

Sort

A choice of how all values in the Order by variable will be sorted: ASC (nulls first) ASC (nulls last) DESC (nulls first) DESC (nulls last) Note that variables of the geography data type aren't sortable.

Examples

Example 1: Basic order

We can sort a table to quickly see the highest scores.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 100   | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 100   | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields

  • Order by: The score variable has the data we want to sort on, so we select it here

  • Sort: We want the data to go from smallest to largest values, so we choose ASC. There are no null values in this table, so we can choose either nulls first or nulls last and get the same result.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | midterm | 100   | sam     | 2020-05-01 |
 | final   | 100   | sam     | 2020-06-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 83    | jane    | 2020-04-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 *---------+-------+---------+------------*/

Example 2: Ordering on multiple variables

Lets say instead we first wanted to sort first by year, then by the the lowest sales number.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 100   | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 100   | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

First block

  • Order by: The first variable we want the data sorted on is score so we choose it in the first block.

  • Sort: We want the earliest information first, so we know we want the information to be ascending. This variable has a null value so it matters whether we want nulls to appear first or last in the order. Since we want it last, we choose ASC (nulls last).

Second block

  • Order by: The second variable we want to sort on is date so we put it here.

  • Sort: Since we want the most recent (highest) values first, we want it to be descending. There are no null values in this variable so where we put the nulls does not matter. We choose DESC (nulls first).

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | final   | 100   | sam     | 2020-06-01 |
 | midterm | 100   | sam     | 2020-05-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 83    | jane    | 2020-04-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 *---------+-------+---------+------------*/

Step: Filter

Overview

The Filter step will select for rows that meet a certain set of conditions.

Example starting data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Filter out rows with scores less than 70.

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | sam     | 74     |
 *---------+--------*/

Step structure:

Basic state

  • A filter step will be made up of one or more filter blocks with a completed condition.

  • When there are multiple filter blocks in a step, conditions in all blocks must be met for rows to be kept.

  • If you have a more complex filter statement that is dependent on multiple nested conditions you can press the + button to expand the filter block.

Expanded state

  • When multiple conditions are needed in a block, you must specify how they relate to each other (AND vs OR)

  • Any nested conditions are resolved before higher level conditions.

Field descriptions

Variable(s)

The variable whose values will be evaluated.

Depending on the operator selected, it may contain multiple variables. Use wildcard characters (e.g. * for bulk selection.

[Operator]

How the two sides of the condition will be evaluated.

Value(s) or Variable(s)

Examples

Example 1: Basic filter

Lets say we only want to reduce our table to only contain information about results from the final test.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Variable(s): The variable test is where the final value we want to evaluate on is located.

  • [Operator]: We want fields that exactly match, so we choose =.

  • Value(s) or variable(s): We want to only keep rows where final is present, so we put that here.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Example 2: Multiple conditions

Let's say we don't just want values from the final but only those from the final with a score above 60.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • We input the data as in the above example, but since we now have two conditions, we have to decided how they relate to each other. In this case we want data that meets all the conditions so select All conditions must be satisfied (AND).

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Example 3: Nested conditions

Let's say we want to keep all data from the final greater than 60, or any any scores above 85.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • We want to keep all rows where scores are over 85 OR are from the final and over 60. So we set up the final and over 60 conditions under an AND umbrella, and nest that under the OR umbrella alongside our condition about scores over 85.

  • When executing, the nested conditions (scores from the final over 60) will be evaluated first to be true or false. Then the higher level condition (scores over 85 OR (scores about the final and over 60) will be evaluated. Any rows that meet this higher level condition will be kept in the output, and any that do not will be discarded.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Example 4: Comparing variables and value lists

Let's say we only want to keep rows where scores are greater than the average score, and that are for our selected students. Right now our selected students are Jane and Pat but we know that might change in the future.

Starting data:

/*---------+-------+---------+------------+---------------*
 | test    | score | student | date       | score_average |
 +---------+-------+---------+------------+---------------+
 | quiz    | 83    | jane    | 2020-04-01 | 69            |
 | quiz    | 35    | pat     | 2020-04-01 | 69            |
 | quiz    | 89    | sam     | 2020-04-01 | 69            |
 | midterm | 74    | jane    | 2020-05-01 | 76.3333       |
 | midterm | 62    | pat     | 2020-05-01 | 76.3333       |
 | midterm | 93    | sam     | 2020-05-01 | 76.3333       |
 | final   | 77    | jane    | 2020-06-01 | 76            |
 | final   | 59    | pat     | 2020-06-01 | 76            |
 | final   | 92    | sam     | 2020-06-01 | 76            |
 *---------+-------+---------+------------+---------------*/

Input fields:

    • With multiple inputs on either side of a condition, the condition is evaluated as an OR. So this condition will evaluate to true if the value in student equals any value on the list (jane or pat).

    • While we could have input jane and pat into the right side of this condition and gotten the same result, using a list makes it easy in the future to change our selected students centrally and update it everywhere the list is used.

  • For our second condition, we want to evaluate how two values in the same row compare to each other.

    • When executed, this condition will look at each row to check if the value in score is greater than the value in score_average. If so, then the condition is met.

Output data:

/*---------+-------+---------+------------+---------------*
 | test    | score | student | date       | score_average |
 +---------+-------+---------+------------+---------------+
 | quiz    | 83    | jane    | 2020-04-01 | 69            |
 | midterm | 74    | jane    | 2020-05-01 | 76.3333       |
 | final   | 77    | jane    | 2020-06-01 | 76            |
 *---------+-------+---------+------------+---------------*/

Reference: Comparison statements

A comparison always evaluates to either TRUE or FALSE.

Comparisons are made up of one or more comparison rows. A comparison row always evaluates to TRUE or FALSE; multiple rows can be nested together with a logical AND/OR.

A comparison row is made up of three components:

Left expression

The left expression can contain variable parameters from the source table, joined tables, as well as any newly created variables. All referenced new variables must be "upstream" from the current comparison, with the exception of joins, which may reference new variables that are constructed by any variables upstream of that join.

Depending on the operator selected, the left expression may contain multiple variables. In this case, each left expression will be evaluated against the right expression(s), and logically joined via an OR. If you want to take the logical AND of multiple variables, create a separate comparison row for each variable.

Multiple left hand values are only supported for = and like operators.

Where multiple variables can be entered in a comparison, you can use * in the interface to select all matching results.

For example, typing DIAG* and pressing enter will add all variables beginning with "DIAG" to this field.

Operator

Redivis supports the following operators:

=, !=

Checks if any the value(s) in the left expression are (not) equal to any values in the right expression. NULLs are treated as equivalent (NULL == NULL -> TRUE and NULL != NULL -> FALSE)

>, >=, <, <=,

Checks if the value in the left expression is less than, greater than, etc. the right expression. String comparisons are lexicographically ordered, other data types are based on the numeric / temporal order of that value. Comparisons between NULL values will always be false.

like / ! like

Checks if the string(s) in the left hand expression matches specified pattern(s) in the right hand expression. The pattern may contain the following characters:

  • A percent sign "%" matches any number of characters

  • An underscore "_" matches a single character

  • You can escape "\", "_", or "%" using one backslash. For example, "\%"

Right expression

The right expression can contain any variables allowed in the left expression, as well as literal values and lists. The comparison row will evaluate to TRUE when the left hand expression matches any of the right hand expressions, except for the != and !like comparators, where the comparison will evaluate to true if all values are not equal to / not like the left expression.

To match against a null datum (empty cell), you must specify the special literal value NULL here.

Step: Rename

Overview

The Rename step changes the name of an existing variable.

Example starting data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Rename score to final_score

/*---------+--------------*
 | student | final_score  |
 +---------+--------------+
 | jane    | 83           |
 | neal    | 35           |
 | sam     | 74           |
 | pat     | 62           |
 *---------+--------------*/

Step structure

  • There will be at least one rename block where you will define a variable and a new name.

Input field definitions

Field
Definition

Source variable

The variable that you want to rename.

New name

Example

We can change the name of a variable to give more meaning to the data.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 100   | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 100   | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Source variable: The variable we want to rename is date so we select that here.

  • New name: date_taken would be a better name for this variable, so we enter it here.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date_taken |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 100   | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 100   | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Step: SQL query

Overview

Redivis supports direct SQL queries for advanced functionality and for users who prefer to work in a programmatic interface. You can add one or more SQL query steps to any transform.

This step accepts SQL code and will execute in the order it has been placed within the rest of the steps.

Query syntax

Referencing the source table

For example:

SELECT * 
FROM _source_
ORDER BY mean_drg_cost
LIMIT 1000

Joining tables

For example:

SELECT t0.id, t0.name, t1.latitude, t1.longitude
FROM _source_ AS t0
INNER JOIN `demo.ghcn_daily_weather_data:7br5:v1_1.stations:g2q3` AS t1
    ON t0.id = t1.id

The query editor will automatically qualify referenced tables and value lists as you reference them, which ensures your query will continue to work if tables get renamed. In most cases, you can just reference the table by name, and the fully qualified reference will be inserted for you.

Referencing upstream variables

You can mix and match SQL steps with other step types within your transform. If you choose to do so, this will introduce some additional language semantics, since transforms allow variables with the same name to co-exist at any given step, delineated in the UI with the t1, t2, ... and v prefixes. The v prefix denotes any newly created variable, while the tN prefix denotes variables that have been joined from other tables.

In order to reference these variables, you must include the prefix in your SQL code before the variable name. The only exception to this rule is for variables prefixed by t0 — these should be referenced normally, allowing you to ignore these rules if you aren't mixing the SQL interface with other step types.

For example:

SELECT 
-- These variables are from the transform's source table, denoted by `t0` in the UI
    id, name, 
-- These variables were joined upstream, denoted by `t1` in the UI
    t1_latitude, t1_longitude, 
-- This variable was created upstream, denoted by `v` in the UI
    v_extracted_year
FROM _source_

Referencing value lists

For example:

SELECT provider_id 
FROM _source_
WHERE provider_state IN UNNEST(@`states:8a2h`) 
    -- If a value list is in the incorrect type, use the following code to convert it
    OR provider_fips IN UNNEST(
        (
            SELECT ARRAY_AGG(CAST(val AS STRING))
            FROM UNNEST(@`fips_codes:134a`)val
        )
    )
ORDER BY mean_drg_cost
LIMIT 1000

-- @states = ['CA', 'TX', etc...]
-- @fips_codes = [6071, 2999, 3242]

Limitations

  • DDL / DML queries are not supported

  • All queries must return fields of a valid Redivis type. The following BigQuery types are not supported in the result set, though they may be used internally within the query:

    • TIMESTAMP

    • NUMERIC

    • BIGNUMERIC

    • BYTES

    • INTERVAL

    • JSON

    • ARRAY

    • STRUCT

  • BigQuery ML syntax is not supported

  • BigQuery scripting language is not supported

  • Statements may only reference tables that exist in your workflow

  • Unnamed variables in the result are not supported

  • Duplicate variable names in the result are not supported

Errors

If your query is invalid, you will see an invalid icon and the error message displayed above your code. Transforms can't be run while any step is invalid.

Step: Limit

Overview

The Limit step reduces the table to a set number of rows.

Example starting data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Limit to 2 rows

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 *---------+--------*/

Step structure

  • There will be one limit block where you will define your limit.

Field definitions

Limit

The number of rows you would like your output data to contain.

The limit step is non-deterministic if the result set isn't ordered. Different rows may be returned from subsequent queries.

Example

Let's say we are doing initial exploratory work on data. We did a number of complex steps and want to quickly execute the query so we can iterate.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Step: Stack

Overview

The Stack step will add or remove rows from another table that align with your existing variable structure.

Example starting data:

Source (t0)             table1 (t1)
/*---------+--------*   /*---------+--------+
 | student | score  |    | student | score  |
 +---------+--------+    +---------+--------+
 | jane    | 83     |    | tom     | 83     |
 | kim     | 35     |    | sherri  | 92     |
 | sam     | 74     |    | pat     | 48     |
 | zay     | 62     |    | jade    | 87     |
 *---------+--------*/   *---------+--------*/

Example output data:

Union rows from Table B.

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | kim     | 35     |
 | sam     | 74     |
 | pat     | 62     |
 | tom     | 83     |
 | sherri  | 92     |
 | pat     | 48     |
 | jade    | 87     |
 *---------+--------*/

Step structure

Field descriptions

Field
Description

Stacked table

The table containing the rows you wish to add.

Stack type

How the rows will be added. (More information below.)

Variable alignment

How you will define the variable structure between the source and the Stacked table. All matching variables: Will automatically match variables and only include variables with a matching name, the rest will be dropped.

All variables: Will automatically match variables and keep all variables in the output. Values will be marked null where there is no match for an existing variable Manually match variables: Will require you to align variables between two tables.

Retype variables

Variables will be retyped in order to be aligned.

Only keep distinct rows

If any added rows are an exact duplicate of an existing row it will be dropped from the output.

Create variable for table name (Union only)

In order to record in the data where new rows came from you can opt to create a new variable with table information.

Stack types

The most common stack type is a Union, which can be particularly useful when combining data which has been broken up into multiple tables with the same structure.

Union

Appends all of the stacked table’s rows to the source table.

Except

Keep only those distinct rows in the source that match rows in the stacked table.

Intersect

Keep only those distinct rows that are in the source and stacked table.

Examples

Example 1: Basic union

Let's say I have data broken out into two tables with the same structure. I want to add all rows together as they are.

Starting data:

Source (t0)             table1 (t1)
/*---------+--------*   /*---------+--------+
 | student | score  |    | student | score  |
 +---------+--------+    +---------+--------+
 | jane    | 83     |    | tom     | 83     |
 | kim     | 35     |    | sherri  | 92     |
 | sam     | 74     |    | pat     | 48     |
 | zay     | 62     |    | jade    | 87     |
 *---------+--------*/   *---------+--------*/

Input fields:

  • Stacked table: My data is in table1 so we select that here

  • Stack type: We want to add all rows so we select Union here.

  • Variable alignment: Somehow our variable names shifted between tables so we can't automatically align tables. We choose Manually match variables here and then fill in the variable names under Source variable and Stacked table variable.

    • Where the variable names are different, the output will keep the name of the Source variable when it exists.

Output data:

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | kim     | 35     |
 | sam     | 74     |
 | pat     | 62     |
 | tom     | 83     |
 | sherri  | 92     |
 | pat     | 48     |
 | jade    | 87     |
 *---------+--------*/   

Example 2: Bulk union and variable misalignment

To continue the previous example, let's say we have a third table of data, and it has additional information that was not present in the first two about state. We want to keep this information and also keep track of which rows came from which table.

Starting data:

Source (t0)             table1 (t1)             table2 (t2)
/*---------+--------*   /*---------+--------*   /*---------+--------+------------*
 | student | score  |    | student | score  |    | student | score  | date       |
 +---------+--------+    +---------+--------+    +---------+--------+------------+
 | jane    | 83     |    | tom     | 83     |    | barb    | 46     | 2020-01-01 |
 | kim     | 35     |    | sherri  | 92     |    | mitch   | 79     | 2020-01-01 |
 | sam     | 74     |    | pat     | 48     |    | oleg    | 68     | 2020-01-01 |
 | zay     | 62     |    | jade    | 87     |    | maria   | 85     | 2020-01-01 |
 *---------+--------*/   *---------+--------*/   *---------+--------+------------*/

Input fields:

  • We create one block for the table1 union, and a second block for the table2 union

  • Variable alignment: Since we want to keep all variables no matter if they have matches, we select All variables.

  • Create variable for table name: We check this box since we want to keep track of which table each row came from.

    • Note that Source table name is automatically populated with the name of this transform's source table.

Output data:

/*---------+--------+------------+---------------*
 | student | score  | date       | source_table  |
 +---------+--------+------------+---------------+
 | jane    | 83     | null       | cohort        |
 | kim     | 35     | null       | cohort        |
 | sam     | 74     | null       | cohort        |
 | zay     | 62     | null       | cohort        |
 | tom     | 83     | null       | table1        |
 | sherri  | 92     | null       | table1        |
 | pat     | 48     | null       | table1        |
 | jade    | 87     | null       | table1        |
 | barb    | 46     | 2020-01-01 | table2        |
 | mitch   | 79     | 2020-01-01 | table2        |
 | oleg    | 68     | 2020-01-01 | table2        |
 | maria   | 85     | 2020-01-01 | table2        |
 *---------+--------+------------+---------------*/  

Example 3: Except

Let's say we have a table with information about students and their test scores, and we have identified some students that have dropped the course and that we no longer need in the original table. We have gathered those students in table1.

Starting data:

Source (t0)             table1 (t1)
/*--------+-------*      /*------+-------*
 | name   | score |       | name | score |
 +--------+-------+       +------+-------+
 | jane   | 83    |       | kim  | 35    |
 | kim    | 35    |       | pat  | 48    |
 | sam    | 74    |       *------+-------*/
 | pat    | 62    | 
 | tom    | 83    |
 | sherri | 92    |
 *--------+-------*/       

Input fields:

Output data:

/*--------+-------*
 | name   | score |
 +--------+-------+
 | jane   | 83    |
 | sam    | 74    |
 | tom    | 83    |
 | sherri | 92    |
 *--------+-------*/       

The rows that were in both tables are now removed.

Example 4: Intersect

Let's say we have two tables full of students, one from who took our fall semester class, and one from the spring semester class. We want to only keep student rows for people who took both classes, so those who are present in both tables.

Source (t0)             table1 (t1)
/*--------+-------*      /*------+-------*
 | name   | id    |       | name | id    |
 +--------+-------+       +------+-------+
 | jane   | 101   |       | kim  | 104   |
 | kim    | 104   |       | pat  | 108   |
 | sam    | 105   |       | mae  | 109   |       
 | pat    | 108   |       | zay  | 110   | 
 | tom    | 112   |       | jade | 111   |
 | sherri | 117   |       | tom  | 112   |
 *--------+-------*/      *------+-------*/      

Input fields

Output data:

/*------+-------*
 | name | id    |
 +------+-------+
 | kim  | 104   |
 | pat  | 108   |
 | tom  | 112   |
 *------+-------*/     

Only rows that were in both tables now remain.

Step: Pivot

Overview

The Pivot step rotates data values into new variables using aggregation.

Example starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 *---------+-------+---------+------------*/

Example output data:

Pivot on the test variable.

/*---------+------+---------+-------*
 | student | quiz | midterm | final |
 +---------+------+---------+-------+
 | jane    | 83   | 74      | 77    |
 | pat     | 35   | 62      | 59    |
 | sam     | 89   | 93      | NULL  |
 *---------+------+---------+-------*/

Step structure

  • There is one pivot block where we will define the pivot variable and values.

  • There are one or more aggregation blocks where we will define how the table will reshape and what data populates the newly created variables.

    • There must always be at least one aggregate block.

    • The total number of new variables in your output table will be the number of Pivot values defined * The number of aggregation blocks.

Input field definitions

Pivot block:

Field
Definition

Pivot variable

The variable containing the values that will become new variables.

Variables to collapse on

All variables you want to include in your output, except for variables defined elsewhere in the pivot operation (pivot variable, variable to aggregate on).

Pivot value

A value from your pivot variable which will become a new variable in your output table. You can add multiple values. You do not need to include every value of this variable.

New variable name

An option to rename any of the new variables created in the pivot value field. Leaving this field blank will persist the value name into the variable name.

Aggregation block:

Field
Definition

Alias

A value that will be prefixed to your new variable names. If you are aggregating one variable this field is optional, but if you are aggregating multiple then you'll need to specify a value for each one.

Aggregation method

How the variable to aggregate will be summarized (e.g. SUM, COUNT).

Variable to aggregate

Which variable's data will populate the new variables you are creating in the pivot value field.

Examples

Example 1: Basic pivot

Let's say we've recorded data about tests given and student scores and we want to pivot in order to see information sorted by student.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 74    | jane    | 2020-04-15 |
 | quiz    | 83    | sam     | 2020-04-15 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 91    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Pivot variable: Since we want to create new variables based on values in the student variable, we select that as our Pivot variable.

  • Variables to collapse on: We want to include all variables here that will be the final shape of our pivoted table that are not already defined as our pivot variable or variable to aggregate. We include test and date in this example.

  • Pivot value: We need to choose one or more values from our selected pivot variable (student) to become new variables in our output table. This does not have to include every value in this variable, even though we use all of them in this example.

  • New variable name: We don't want to rename any of these new variables, so we can leave this blank.

  • Alias (optional): Since we only have one variable to aggregate, we can leave this blank.

  • Aggregation method: We need to choose how our variable to aggregate (score) will be aggregated. In this example data we don't have duplicated values so what we choose here won't matter. So we choose SUM to validate the process.

  • Variable to aggregate: This is the variable (score) that will be removed from this table, and its values will be redistributed to our newly created variables (jane, pat, and sam) with the aggregation method we selected handling any duplicate values (in this example, SUMming them).

Output data:

/*---------+------------+------+------+------*
 | test    | date       | jane | pat  | sam  |
 +---------+------------+------+------+------+
 | quiz    | 2020-04-01 | 83   | 35   | 89   |
 | quiz    | 2020-04-15 | 74   | NULL | 83   |
 | midterm | 2020-05-01 | 74   | 62   | 93   |
 | final   | 2020-06-01 | 77   | NULL | 91   |
 *---------+------------+------+------+------*/

Example 2: Pivot with collapse

Continuing this example, let's say we want to average scores from the two quizzes. We can collapse the table by not including the date variable. Without that variable there would be two identical values in test (quiz) which will be collapsed into one.

Starting table:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 74    | jane    | 2020-04-15 |
 | quiz    | 83    | sam     | 2020-04-15 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 91    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

In this case it does matter what we select for Aggregation method, since the duplicate values for quiz will get aggregated. We select Average here.

Pivoted output table:

/*---------+------+------+------*
 | test    | jane | pat  | sam  |
 +---------+------+------+------+
 | quiz    | 78.5 | 35   | 86   |
 | midterm | 74   | 62   | 93   |
 | final   | 77   | NULL | 91   |
 *---------+------+------+------*/

Example 3: Multiple aggregation variables

We can add another variable aggregation block to gather more information about the variables being aggregated. Continuing this example, let's say we not only want to average the quiz scores, but count the number of averaged quizzes.

Starting table:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | quiz    | 74    | jane    | 2020-04-15 |
 | quiz    | 83    | sam     | 2020-04-15 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 91    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

Note that the Alias field is no longer optional since we have two aggregation blocks.

We make sure to select score for our Variable to count since we specifically want to know how many scores were included in our average. Since we have some null scores, selecting a different variable here might give us a different result.

Pivoted output table:

/*---------+---------------+--------------------+--------------+----------------*
 | test    | jane_scoreavg | jane_scorecount    | pat_scoreavg | pat_scorecount |
 +---------+---------------+--------------------+--------------+----------------+
 | quiz    | 78.5          | 2                  | 35           | 1              |
 | midterm | 74            | 1                  | 62           | 1              |
 | final   | 77            | 1                  | NULL         | 1              |
 *---------+---------------+--------------------+--------------+----------------*/

The jane_scoreavg variable contains data from the scoreavg aggregation block's method of Average for the values of student Jane, while the jane_scorecount variable contains data from the scorecount aggregation block.

Step: Join

Overview

A Join step will combine data from two tables based on a join condition so it can be queried together.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

Example output data:

Inner join where student = student

/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

Step structure

  • There will be at least one join block where you will define a table to join and a join condition for that table.

  • When multiple blocks exist, the tables will be joined in sequence.

Field descriptions

Field
Description

Join table

The table containing the data you wish to combine with this transform's data. Once selected, this table's variable will be marked t1 (or t2 etc. if other joins exist).

Join type

The way that the tables will be combined. (More information below.)

Source variable

The variable in your source table that you want to join on. All records in this variable will be matched against all records in the Joined table variable.

Joined table variable

The variable in your selected Join table that you want to join on. All records in this variable will be matched against all records in the Source variable.

Join types

Inner join

If a row in either the source or join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

Left join

If a row in the source doesn’t have a match, all joined variables will be null for that row. If a row in the join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

LEFT JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | zay     | 2        | NULL  |
 *---------+----------*-------*/

Right join

If a row in the source doesn’t have a match, it will be dropped. If a row in the join table doesn’t have a match, all source variables will be null for that row. If a row is matched multiple times, it will be multiplied in the output. This is the same as a left join, but reversed.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

RIGHT JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | toni    | NULL     | 30    |
 *---------+----------*-------*/

Full join

If a row in either the source or join table doesn’t have a match, all source variables will be null for that row, including for the column being joined upon. If a row is matched multiple times, it will be multiplied in the output.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

FULL JOIN on student

Output:
/*-----------+-----------+----------+--------*
 | student_A | student_B | absences | score  |
 +-----------+-----------+----------+--------+
 | jane      | jane      | 0        | 85     |
 | sam       | sam       | 6        | 64     |
 | pat       | pat       | 1        | 88     |
 | zay       | NULL      | 2        | NULL   |
 | NULL      | toni      | NULL     | 30     |
 *-----------+-----------+----------+--------*/

Cross join

Every row in the source table will be combined with every row in the joined table. This might be used to perform a join on a new variable that will be created downstream (such as in a geospatial join). You will almost always need to use a filter after this join for the query to successfully execute.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
*----------+----------*/      *---------+-------*/

CROSS JOIN

Output:
/*---------+----------+---------+---------*
 | student | absences | student | score   |
 +---------+----------+---------+---------+
 | jane    | 0        | jane    | 85      |
 | jane    | 0        | sam     | 64      |
 | sam     | 6        | jane    | 85      |
 | sam     | 6        | sam     | 64      |
 *---------+----------*---------+---------*/

Join structures

Depending on the structure of your data, joins might multiply in rows.

1-to-1 joins

If the variable you are joining on has no duplicate values in either the source table or the joined table, then your output table will always have the same or fewer rows than the sum of both table's rows added together.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

1-many joins

If the variable you are joining on has duplicates in either table, any time there is is a duplicate row it will be matched for each duplicate.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      | pat     | 86    |
                              *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | pat     | 1        | 86    |
 *---------+----------*-------*/

Many-to-many joins

If the variable you are joining on has duplicates in both tables, any time there is is a duplicate row it will be matched for each duplicate.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | pat     | 3        |       | pat     | 68    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | pat     | 1        | 86    |
 | pat     | 3        |     |
 | pat     | 3        | 86    |
 *---------+----------*-------*/

Examples

Example 1: Simple join condition

Let's say our information about student absenses and test scores are in separate tables and we want to join them together.

Starting data:

Source table (t0)             Table 1 (t1)    
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

Input fields:

  • Join table: The data we want to join is in Table 1 so we select it here.

  • Join type: Since we only care about students that have both absence and score information we choose inner join which will drop any rows without a match in both tables.

  • Source variable / Joined table variable: We want to match on the variable student which is present in both tables. So we select our source table (t0)'s student on the left, and our joined table (t1)'s student variable on the right.

    • Since this variable has the same name in both tables, they will be combined into one variable in the output table.

Output data:

/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

Example 2: More specific join condition

Let's say instead of a table with aggregated absences we have a daily timeliness chart. We want to join the scores for the corresponding attendance information for both the student and date in question.

Starting data:

Source table (t0)                        Table 1 (t1)    
/*---------+------------+---------*      /*---------+------------+-------*
 | student | date       | on_time |       | student | date       | score |
 +---------+------------+---------+       +---------+------------+-------+
 | jane    | 2020-01-01 | TRUE    |       | jane    | 2020-01-01 | 85    |
 | sam     | 2020-01-01 | FALSE   |       | sam     | 2020-01-01 | 65    |
 | jane    | 2020-02-01 | FALSE   |       *---------+------------+-------*/
*----------+------------+--------*/     

Input fields:

Join type: Since we want to keep our absence data whether a test was taken or not, we do a Left join.

Join condition: We join on fields where BOTH the values in the student and date variables are a match by selecting both in our join condition.

Output data:

/*---------+------------+---------+-------*
 | student | date       | on_time | score |
 +---------+------------+---------+-------+
 | jane    | 2020-01-01 | TRUE    | 85    |
 | sam     | 2020-01-01 | FALSE   | 65    |
 | jane    | 2020-02-01 | FALSE   | NULL  |
*----------+------------+---------+-------*/     

Example 3: Complex join conditions with cross joins

Starting data:

Source table (t0)                Table 1 (t1)    
/*------------+----------*      /*------------+-------*
 | student_id | absences |       | student_id | score |
 +------------+----------+       +------------+-------+
 | 01         | 0        |       | 01         | 85    |
 | 02         | 6        |       | 02         | 64    |
 | 03         | 1        |       | 03         | 88    |
*-------------+----------*/      *------------+-------*/

Input fields:

Output data:

/*------------+----------+-------*
 | student_id | absences | score |
 +------------+----------+-------+
 | jane       | 0        | 85    |
 | sam        | 6        | 64    |
 | pat        | 1        | 88    |
 *------------+----------*-------*/

Variable creation methods

Every method is documented here with what information the Redivis interface needs, as well as a link to the underlying BigQuery documentation for more details.

Common elements

Parameters

When creating a new variable, you'll use a variety of inputs to specify the parameters required to fully define a given method. Each parameter of the following types:

variable

literal

enum

boolean

Refers to values true or false.

Null

Note that this is different than if the cell contained a string with the characters NULL. In this case the cell does have contents and will be treated so.

Format elements for dates and time methods

Some new variable methods (PARSE... and FORMAT...) allow for manipulation of Date, Time, or DateTime variable types. To do so, you'll need to define how the data is formatted using format elements.

For example, to work with dates in a mm/dd/yy format (common in the U.S.; e.g. 03/22/89 ), we would specify the format string %m/%d/%y . For a DateTime displayed as Mon Oct 17 2016 17:32:56 , we would specify %a %b %d %Y %H:%M:%S

Below is a complete list of format elements and descriptions:

Accessor elements for JSON methods

Some new variable methods (JSON extract..., JSON scalar, etc) allow for accessing subsets of the data contained within JSON-formatted strings. To do so, you'll need to define how the data is accessed using a JSONPath element.

For example, to access the firstName attribute within a JSON-formatted string { "firstName": "John", "lastName": "Doe" }, you'd specify a JSONPath element $.firstName.

Aggregate

Any value

Return type

dynamic (input-dependent)

Parameters

Average

Return type

float

Parameters

Count

Return type

integer

Parameters

Logical and

Return type

boolean

Parameters

Logical or

Return type

boolean

Parameters

Max

Return type

dynamic (input-dependent)

Parameters

Min

Return type

dynamic (input-dependent)

Parameters

String aggregate

Return type

string

Parameters

Sum

Return type

dynamic (input-dependent)

Parameters

Variable selection

Overview

In the bottom pane of the transform you will need to select the variables you'd like to keep in your output table. One or more variables must be kept in order to run the transform.

Usage

Click on a variable and use the center buttons to move it between the two lists. You can use command-/control- and shift-click to select multiple variables at once.

For larger variable lists, clicking the search icon will allow you to quickly filter the available variable options down based on search terms.

Distinct

You can select the Distinct box to drop any records that are an exact copy of another record across values in all variables in your output table.

Source indicators

Hovering on any variable will show a tooltip with information about where the variable came from, it's type, and any assigned label.

You will also see to the left of variables in the list an icon showing the variable's source. For variables originating in the source table, the icon will show t0 and for variables from subsequently joined in tables the icon will show t1, t2, etc. For variables created in this transform it will show v.

Selection options

Optimization and errors

Query optimization

The Redivis transform connects to a highly performant, parallelized data store. Queries on terabytes of data can complete in seconds, often utilizing the resources of thousands of compute nodes. These are some best practices that can help you increase performance.

Limit output table size

Table writes are generally much slower than table reads — if your output table is exceptionally large, it may take the querying engine several minutes to materialize the output. Try restricting the number of rows returned by applying row filters to your transforms when possible, and be cognizant of joins that may substantially increase your row count. Avoid keeping variables that aren't needed.

Reduce the number of new variables

A common anti-pattern is to construct numerous boolean CASE new variables, and then use the result of these new variables in the row filter(s). If possible, it is far more efficient to inline the CASE logic within the row filters, or within fewer new variables, as this allows for improved efficiency in logical short-circuiting.

Optimize join patterns

While the query optimizer can determine which table should be on which side of the join, it is still recommended to order your joined tables appropriately.

If all of your joins are INNER joins, the join order will have no impact on the final output. If your query leverages combinations of left / right / inner joins, the join order may affect your output; be careful in these cases.

Common errors

The Redivis transform will prevent you from initiating a run when any steps are invalid, and errors should be rare. However, some errors can only be detected as the query is run — in these cases the job will fail as soon as it encounters an error, logging the error message to the top of the transform.

If you come across an error message not on this list, please email support@redivis.com for further assistance.

Resources exceeded during query execution

This error occurs when a query utilizes too much memory, yet it is often easily resolvable and due to unintended behavior within the transform. This error is caused by a certain component of the query not being parallelizable, which often occurs when combining and / or ordering many distinct values. We recommend investigating the following culprits:

  1. Investigate any order clauses in your transform — either at the bottom of the transform or in a partitioned query. Often, attempts to order on hundreds of millions or billions of distinct values will fail. Note that ordering rows at the bottom of the transform does not affect your output, and is only useful in preparing your data for export.

  2. Confirm that none of your aggregation methods are creating massive cells. For example, using the String aggregate method on an exceptionally large partition can collapse and concatenate many values into one record — if the cell becomes too big, this error will be thrown.

Cast / type conversion errors

When converting between variable types, all values must be appropriately formatted for conversion to the new type. For example, the value "1,000" is not a valid integer and will throw an error when being converted from a string to an integer.

There are several options for getting around cast errors:

  1. Choose the "If invalid for type, set to null" option in your retype blocks. Note that this will set all invalid values to NULL, potentially causing unintended side effects. Use with caution.

  2. Filter out all records that have incompatible values.

  3. Create a new variable, using the Case method to convert any invalid values to something that can be appropriately cast.

Maximum table size

If an output table is more than 1TB, it cannot exceed the size of the sum of all source tables, + 10%. Very large output tables that substiantially exceed their inputs are typically the result of a misconfigured join that generates a cross-product between a one-to-many or many-to-many relationhip between multiple tables. If you encounter this error, try to apply filter and aggregation steps first, and also validate that your join conditions are appropriately specific.

Too many tables, views and user-defined functions for query: Max: 1000

To work around this issue, you can create a transform that simply selects all variables from the unreleased table, materializing the result in your workflow. This output table will now only count as a single table in your query, avoiding this error.

This error will also no longer be an issue once the version is released, as the table is materialized shortly after a version is released.

Other errors

If you come across any other errors or issues while using the transform please contact us directly at support@redivis.com

Value lists

Overview

Value lists allow you to persist a group of values in a list, for use throughout a workflow. You can centrally view and edit this list, and any changes to it will be pushed to all transforms where it is being used.

This may be useful in a scenario where a list of diagnosis codes are entered in multiple filters that may change over time as you build you analysis, or when you have a single parameter you are using in multiple places that you want to be able to quickly change anywhere.

Building lists

To create and update value lists, you can click the Lists button at the top left of workflow tool top bar.

You can create, modify, and delete lists from this interface.

For bulk importing you can copy/paste values from any standard spreadsheet program, or import them from a csv or json file. Please note that string list values should not be quoted — learn more about list types below.

You can then reference the list in transforms throughout your workflow anywhere that you could input multiple values.

Importing lists

Values can be copied into lists from any spreadsheet program by directly using the clipboard. For bulk or scripted actions, you may also import multiple lists as a JSON or CSV file. Note that any list exports will follow the same format.

CSV specification

Each variable in your CSV will represent one list, with the first row containing the name of the list. E.g.:

JSON specification

The same set of value lists can be imported via JSON, with the following format:

List types

Redivis will automatically determine the most specific type for all the values in a list, falling back to the "universal type" of string.

For example, these combinations will yield the corresponding specific types and can be compared to the following types:

Note list values should never be quoted, unless you specifically want to match against a quote character.

The way that the new variable will be created. Choosing this will bring up additional fields to complete specific to the chosen method. .

The variable or value which the previously chosen Variable(s) will be evaluated against. Depending on the operator selected, it may contain multiple variables, values, or a . Use wildcard characters (e.g. * for bulk selection.

First step, we create a called selected_students with the values jane and pat on it. Then we can select it as compared to student.

The name that your selected source variable will be given. Note that this input needs to follow Redivis variable .

Redivis supports the and features, with . All queries must reference the SQL step's source, represented as _source_ in the query. If the SQL step is the first step in the transform, the source represents the transform's source table. Otherwise, the source represents the output of the previous step in the transform.

You can also reference any other table in the current workflow from your SQL step, following the same reference rules as . After typing the backtick character (`), available tables will be presented as autocomplete options.

You may also use in your query — they can be referenced as a SQL parameter, prefixed with the @ symbol. Note that all value lists are explicitly typed, and you will need to cast the value list if there is a type mismatch.

It is strongly recommended to use a step with a cross join to avoid a massively expanded table.

You can use a step to build more complex join conditions

In some cases you might need to do a more complex join, such as using an operator other than =, or joining on a condition that uses a variable from your joined table that needs to be or first. This is common in geospatial joins but might come up in any situation.

To use a more complex join condition you can use a cross join, followed by . The cross join will make all possible joins between these two tables and then the filter will limit it to rows matching your condition. Effectively this will execute the same as if you had done your initial join with a more complex condition.

Let's say in the example below we want to do an inner join where T0 student_id = T1 student_id but in the source table student_id is an integer variable and in Table 1 it is a string. We want to retype the variable in Table 1 before we can join on it, but we can't retype it until we have joined in Table 1. So we will do a cross join, a retype, then a filter to narrow down to only records meeting our condition.

When using a step (or when creating aggregate variables in an or step) you will need to select a method to specify how the variable will be created.

Refers to a Redivis , often limited to a specific subset of . When using variable parameters in a given transform, you'll be able to select variables from the , variables in tables referenced by , or created upstream of the parameter.

Refers to a string, integer, boolean, or floating point value. When using literal parameters, you'll be able to type any constant value (e.g, false, 1.0, 'test'), or use a to reference the same set of values anywhere in your workflow. If using a literal parameter alongside another parameter, e.g., in a , you may have to match the type of literal with that parameter.

Refers to a set of distinct values, usually of a homogenous . For example, if a required parameter used a "timezone" enum with 3 options (PST, MST or EST), you would have to select one of the three options (PST) for the value of that parameter.

In many places throughout the transform you will have an option in the system menu to set a cell value(s) to NULL. Conceptually this means that the cell is empty and contains no value. If you look at the of a table, you will see NULL values shown in a grey color to indicate that these cells are empty.

You will see throughout the transform references to how nulls are included or excluded for steps (such as the step) and in calculating .

Note that different format elements are used for the Retype step (CAST method) which are .

More information is available in BigQuery's .

Returns any value from the input or NULL if there are zero input rows –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the average of all non-null values –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the count of all non-null values –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the logical AND of all non-NULL expressions –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the logical OR of all non-NULL expressions –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the maximum value of all non-null inputs –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the minimum value of all non-null inputs –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a string obtained by concatenating all non-null values –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the sum of all values, ignoring nulls –>

Name
Type
Allowed values
Required
Placeholder (in UI)

This can be useful in situations where you using analytic methods or if your data is in a particular format. Otherwise this might slow down your transform run time unnecessarily.

Some steps will add or remove variables as options for selection to keep in the discard table. and occasionally will bring in variables from other tables. will make new variables. will remove all variables that are not included in the collapse, but has the potential to add newly created aggregate variables.

When you are performing initial exploration, you may consider using a step in your transform to reduce the output table size.

Each adds to the computational complexity of the query; the new variable must be computed for every row in the table.

When your query utilizes a step, consider the order in which you are joining the data. The best practice is to place the largest table first, followed by the smallest, and then by decreasing size.

This error may occur when running queries on tables belonging to an unreleased version, particularly when these tables are made up of hundreds of . Under the hood, these unreleased tables are represented as a logical view that stitches the various uploads together into a single table. If your query references multiple unreleased tables, with each approaching the 500 per-table upload limit, it's possible to exceed the total allowed number of tables referenced by a query.

If you update the values within a list, all transforms that reference the list will be marked as .

All variables in Redivis have a . When using lists to filter rows the list type needs to be the same (or coercible to) the variable that is being filtering on.

See all methods here
value list
those in the Redivis API
value lists
Filter
SQL query
retyped
created
filter
BigQuery Standard SQL Syntax
certain limitations
type

Format element

Description

%A

The full weekday name.

%a

The abbreviated weekday name.

%B

The full month name.

%b or %h

The abbreviated month name.

%C

The century (a year divided by 100 and truncated to an integer) as a decimal number (00-99).

%c

The date and time representation.

%D

The date in the format %m/%d/%y.

%d

The day of the month as a decimal number (01-31).

%e

The day of month as a decimal number (1-31); single digits are preceded by a space.

%F

The date in the format %Y-%m-%d.

%G

The ISO 8601 year with century as a decimal number. Each ISO year begins on the Monday before the first Thursday of the Gregorian calendar year. Note that %G and %Y may produce different results near Gregorian year boundaries, where the Gregorian year and ISO year can diverge.

%g

The ISO 8601 year without century as a decimal number (00-99). Each ISO year begins on the Monday before the first Thursday of the Gregorian calendar year. Note that %g and %y may produce different results near Gregorian year boundaries, where the Gregorian year and ISO year can diverge.

%H

The hour (24-hour clock) as a decimal number (00-23).

%I

The hour (12-hour clock) as a decimal number (01-12).

%j

The day of the year as a decimal number (001-366).

%k

The hour (24-hour clock) as a decimal number (0-23); single digits are preceded by a space.

%l

The hour (12-hour clock) as a decimal number (1-12); single digits are preceded by a space.

%M

The minute as a decimal number (00-59).

%m

The month as a decimal number (01-12).

%n

A newline character.

%P

Either am or pm.

%p

Either AM or PM.

%R

The time in the format %H:%M.

%r

The 12-hour clock time using AM/PM notation.

%S

The second as a decimal number (00-60).

%s

The number of seconds since 1970-01-01 00:00:00. Always overrides all other format elements, independent of where %s appears in the string. If multiple %s elements appear, then the last one takes precedence.

%T

The time in the format %H:%M:%S.

%t

A tab character.

%U

The week number of the year (Sunday as the first day of the week) as a decimal number (00-53).

%u

The weekday (Monday as the first day of the week) as a decimal number (1-7).

%V

The week number of the year (Monday as the first day of the week) as a decimal number (01-53). If the week containing January 1 has four or more days in the new year, then it is week 1; otherwise it is week 53 of the previous year, and the next week is week 1.

%W

The week number of the year (Monday as the first day of the week) as a decimal number (00-53).

%w

The weekday (Sunday as the first day of the week) as a decimal number (0-6).

%X

The time representation in HH:MM:SS format.

%x

The date representation in MM/DD/YY format.

%Y

The year with century as a decimal number.

%y

The year without century as a decimal number (00-99), with an optional leading zero. Can be mixed with %C. If %C is not specified, years 00-68 are 2000s, while years 69-99 are 1900s.

%%

A single % character.

%E#S

Seconds with # digits of fractional precision.

%E*S

Seconds with full fractional precision (a literal '*').

%E4Y

Four-character years (0001 ... 9999). Note that %Y produces as many characters as it takes to fully render the year.

ANY_VALUE(@variable)
AVG(@variable)
COUNT([@distinct ]@variable)
LOGICAL_AND(@variable)
LOGICAL_OR(@variable)
MAX(@variable)
MIN(@variable)
STRING_AGG([@distinct ]@variable[, @delimiter][ LIMIT @limit])
SUM(@variable)

Fruits

Vegetables

Item_codes

Strawberry

Broccoli

1

Tomato

Asparagus

2

Apple

Spinach

3

[
    { 
        "name": "Fruits",
        "values": ["Strawberry", "Tomato", "Apple"]
    },
    { 
        "name": "Vegetables",
        "values": ["Broccoli", "Asparagus", "Spinach"]
    },
    { 
        "name": "Item_codes",
        "values": ["1", "2", "3"]
        // This is identical:
        // "values": [1, 2, 3]
    },
]

Values

Type

Can compare to

1, 2, 3

Integer

Integer, Float, String

1, 2.0, 3

Integer

Integer, Float, String

1, 1.0, 1.1

Float

Float, String

1, 2.0, a

String

String

1971-01-01, 1972-02-02, 1973-03-03

Date

Date, String

1971-01-01, a, b

String

String

1970-01-01 12:00:00, 1970-01-01 12:30:00

DateTime

DateTime, Date, String

value list
create variables
aggregation
pivot
JSONPath format documentation
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
create variables
Join
Merge
Create variables
Aggregate
limit
new variable
join
edited
defined type
naming constraints
type
variable
joins
new variables
types
source table
value list
filter comparison
Cells view
Order
summary statistics
independent uploads
format strings
format strings
format strings
detailed elsewhere
API
s
APIs
API docs
API
See more examples in the Stata notebooks reference.
notebook type
versioned packages
pre-/post- install script
Notebooks
create output tables
dataframe
create output tables
dependencies
exported
create output tables
See more examples in the Python notebooks reference.
See more examples in the Python notebooks reference.
Working with unstructured data files in Python >

@variable

true

-

@variable

true

-

@variable

false

(*)

@distinct

true

-

@variable

true

-

@variable

true

-

@variable

true

-

@variable

true

-

@variable

true

-

@delimiter

false

(",")

@distinct

true

-

@limit

false

-

@variable

true

-

Step: Retype

Overview

The Retype step converts a variable of a given type to another type.

Example starting data

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83     |
 | neal    | 35     |
 | sam     | 74     |
 | pat     | 62     |
 *---------+--------*/

Example output data:

Retype score from integer to float.

/*---------+--------*
 | student | score  |
 +---------+--------+
 | jane    | 83.0   |
 | neal    | 35.0   |
 | sam     | 74.0   |
 | pat     | 62.0   |
 *---------+--------*/

Step structure

  • There will be at least one retype block where you will define a variable and a new type.

Field definitions

Field
Definition

Source variable

The variable that you want to retype. Note that you can see its current type by locating it in the variable selector at the bottom of the page, or hovering over the variable in this selection menu to see a tooltip with more information.

New type

If invalid for type, set to null

Whether failing type conversions should be converted to a null value. By default, failed type conversions will throw an error. Note that this might significantly change the content of your data and we suggest using this option with full understanding of how it might affect your outcome and verifying the output results.

Specify custom format

Examples

Example 1: Basic usage

We can convert sales data currently stored as an integer type into float to use in another formula elsewhere that only accept the float type.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Source variable: The variable we want to convert is score so we select it here.

  • New type: We want this to be a float, and since the score variable is currently an integer it is compatible with conversion to the float type, so we can select Float here.

  • If invalid for type, set to null: Since there are no incompatible values in this variable it doesn't matter what we choose. We leave it unchecked to validate that we understand our data and to confirm that this transform will execute without failing.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83.0  | jane    | 2020-04-01 |
 | quiz    | 35.0  | pat     | 2020-04-01 |
 | quiz    | 89.0  | sam     | 2020-04-01 |
 | midterm | 74.0  | jane    | 2020-05-01 |
 | midterm | 62.0  | pat     | 2020-05-01 |
 | midterm | 93.0  | sam     | 2020-05-01 |
 | final   | 77.0  | jane    | 2020-06-01 |
 | final   | 59.0  | pat     | 2020-06-01 |
 | final   | 92.0  | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Our output table looks as it should when we look at the cells in the output table, and we can confirm the new type by clicking on the variable in the output table to check the type.

Example 2: Handling invalid conversions

Lets say instead that in our initial data, the sales variable was stored as a string. Converting this to a float would be a bit trickier since the data entry wasn't as clean.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35%   | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Input fields:

  • Source variable: Same as above example.

  • New type: Same as above example.

  • If invalid for type, set to null: Since this data has 35% as a value, this can't be converted to a float. If we leave this box unchecked our transform will fail. Checking it will set that value to null.

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83.0  | jane    | 2020-04-01 |
 | quiz    | null  | pat     | 2020-04-01 |
 | quiz    | 89.0  | sam     | 2020-04-01 |
 | midterm | 74.0  | jane    | 2020-05-01 |
 | midterm | 62.0  | pat     | 2020-05-01 |
 | midterm | 93.0  | sam     | 2020-05-01 |
 | final   | 77.0  | jane    | 2020-06-01 |
 | final   | 59.0  | pat     | 2020-06-01 |
 | final   | 92.0  | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Note that while this retype was successful, our data might not be what we want, and in this case removes information.

Example 3: Parsing dates

Let's say we want to convert our year variable which is currently a string to a Date variable type, but that the starting format does not cleanly translate.

Starting data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 04/01/2020 |
 | quiz    | 35    | pat     | 04/01/2020 |
 | quiz    | 89    | sam     | 04/01/2020 |
 | midterm | 74    | jane    | 05/01/2020 |
 | midterm | 62    | pat     | 05/01/2020 |
 | midterm | 93    | sam     | 05/01/2020 |
 | final   | 77    | jane    | 06/01/2020 |
 | final   | 59    | pat     | 06/01/2020 |
 | final   | 92    | sam     | 06/01/2020 |
 *---------+-------+---------+------------*/

Input fields:

  • Source variable: The variable we want to convert is date so we select it here.

  • New type: We want this to be a date so we can select Date here.

  • If invalid for type, set to null: Since there are no incompatible values in this variable it doesn't matter what we choose. We leave it unchecked to validate that we understand our data and to confirm that this transform will execute without failing.

  • Specify custom format: Since our data does not fit the standard format (%Y-%m-%d, e.g. 2020-10-01) we need to specify what format it is in. We can use the reference table at the bottom of this page to specify our format: MM/DD/YYYY

Output data:

/*---------+-------+---------+------------*
 | test    | score | student | date       |
 +---------+-------+---------+------------+
 | quiz    | 83    | jane    | 2020-04-01 |
 | quiz    | 35    | pat     | 2020-04-01 |
 | quiz    | 89    | sam     | 2020-04-01 |
 | midterm | 74    | jane    | 2020-05-01 |
 | midterm | 62    | pat     | 2020-05-01 |
 | midterm | 93    | sam     | 2020-05-01 |
 | final   | 77    | jane    | 2020-06-01 |
 | final   | 59    | pat     | 2020-06-01 |
 | final   | 92    | sam     | 2020-06-01 |
 *---------+-------+---------+------------*/

Reference: Type conversion

Starting type
Possible destinations
Notes

String

Integer Float Boolean Date DateTime Time Geography

To integer: A hex string can be cast to an integer. For example, 0x123 to 291 or -0x123 to -291. To float: Returns x as a floating point value, interpreting it as having the same form as a valid floating point literal. To boolean: Returns TRUE if x is "true" and FALSE if x is "false". All other values of x are invalid and throw an error instead of casting to a boolean. A string is case-insensitive when converting to a boolean. To date, dateTime, or time: Uses the canonical format by default (see information below)

Integer

String

Float Boolean

To float: Returns a close but potentially not exact floating point value.

To boolean: Returns FALSE if x is 0, TRUE otherwise.

Float

String Integer

To integer: Returns the closest integer value. Halfway cases such as 1.5 or -0.5 round away from zero.

Boolean

String Integer

To string: Returns true if x is true, false otherwise. To integer: Returns 1 if x is true, 0 otherwise.

Date

String

DateTime

String Date Time

To date, dateTime, or time: Uses the canonical format by default (see information below)

Time

String Date DateTime

To date, dateTime, or time: Uses the canonical format by default (see information below)

Geography

String

Reference: Cannonical representation

When retyping between a String type variable and a Date, DateTime, or Time type variable it is presumed that the data will be in the format below.

Layout
Example

Date

(Four digit year)- (1 or 2 digit month)- (1 or 2 digit date)

2023-01-01 2023-1-1

Time

(1 or 2 digit hour): (1 or 2 digit minute): (1 or 2 digit second). (Up to 6 fractional seconds)

01:01:01.123456 6:2:9 22:19:3

DateTime

(Date specification) (space or T or t) (Time specification)

2023-01-01 01:01:01.123456 2023-1-1T6:2:9

If it is not in this canonical format you can click the Specify custom format field and use Format elements (below) to indicate otherwise.

Reference: Format elements

Since Date, DateTime, and Time variable types contain structured information you can use format strings to indicate how you want different pieces of date and time information translated to and from string format when retyping.

For example when converting a date to a string you can choose whether it will become JAN 1 2023 or 2023-01-01. When converting from a string to a DateTime you'll need to outline how the information is structured in your data so it can be read in correctly.

You can use these elements in the Specify custom format field.

Element
Return

YYYY

Four (or more) digit year

Input: 2023-01-01 Output: 2023

Input: 23-01-01 Output: 0023

Input: 20000-01-01 Output: 20000

YYY

Last three digit year

Input: 2023-01-01 Output: 023

Input: 23-01-01 Output: 023

YY

Two digit year

Input: 2023-01-01 Output: 23

Input: 2-01-30 Output: 02

Y

Last one digit of year

Input: 2023-01-01 Output: 3

MM

Two digit month

Input: 2023-01-01 Output: 23

MON

Three character month: JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, DEC

Input: 2023-01-01 Output: JAN

MONTH

Month name

Input: 2023-01-01 Output: JANUARY

DDD

Three digit day of the year

Input: 2023-01-01 Output: 001

DD

Two digit day of the month

Input: 2023-01-01 Output: 01

D

Day of the week (1-7) with Sunday being 1

Input: 2023-01-01 Output: 1

DAY

Day of the week. Spaces are padded on the right side to make the output size exactly 9.

Input: 2023-01-01 Output: SUNDAY

DY

Three character day: MON, TUE, WED, THU, FRI, SAT, SUN

Input: 2023-01-01 Output: SUN

HH

Two digit hour of the day (valid values from 00 to 12)

Input: 20:10:15 Output: 10

HH12

Hour of the day (valid values from 00 to 12)

Input: 20:10:15 Output: 10

HH24

Two digit hour (valid values from 00 to 24)

Input: 20:10:15 Output: 20

MI

Two digit minute

Input: 20:10:15 Output: 10

SS

Two digit second

Input: 20:10:15 Output: 15

SSSSS

Five digit second

Input: 20:10:15 Output: 15234

FFn

(Replace n with a value from 1 to 9. For example, FF5.)

Fractional part of the second, n digits long. The fractional part of the second is rounded to fit the size of the output.

FF1

Input: 20:10:15 Output: 1

FF2 Input: 20:10:15 Output: 15

FF3 Input: 20:10:15 Output: 015

A.M. or AM P.M. or PM

A.M. (or AM) if the time is less than 12, otherwise P.M. (or PM). The letter case of the output is determined by the first letter case of the format element.

AM

Input: 09:10:15 Output: AM

A.M. Input: 20:10:15 Output: P.M.

PM Input: 09:10:15 Output: AM PM Input: 20:10:15 Output: PM

TZH

Hour offset for a time zone. This includes the +/- sign and 2-digit hour.

Input: 2008-12-25 05:30:00+00 Output: −08

TZM

Minute offset for a time zone. This includes only the 2-digit minute.

Input: 2008-12-25 05:30:00+00 Output: 00

A space

Input: Output:

-./,'l;:

Same character in the output

Input: -./,'l;: Output: -./,'l;:

"Text"

Output is the value within the double quotes. To preserve a double quote or backslash character, use the \" or \\ escape sequence. Other escape sequences are not supported.

Input: "abc" Output: abc

Input: "a\"b\\c" Output: a"b\c

Case (if/else)

CASE
  WHEN @condition THEN @result
  [ ... ]
  [ ELSE @else_result ]
  END

Return type

dynamic (input-dependent)

Parameters

JSON

JSON query

JSON_QUERY(@expression, @expression_2)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

@expression_2

true

-

JSON value

JSON_VALUE(@expression, @expression_2)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

@expression_2

false

($)

DateTime

Current DateTime

CURRENT_DATETIME([@time_zone])

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@time_zone

any of: ACDT, ACST, ACT, ACT, ACWST, ADT, AEDT, AEST, AFT, AKDT, AKST, AMST, AMT, AMT, ART, AST, AST, AWST, AZOST, AZOT, AZT, BDT, BIOT, BIT, BOT, BRST, BRT, BST, BST, BST, BTT, CAT, CCT, CDT, CDT, CEST, CET, CHADT, CHAST, CHOT, CHOST, CHST, CHUT, CIST, CIT, CKT, CLST, CLT, COST, COT, CST, CST, CST, CT, CVT, CWST, CXT, DAVT, DDUT, DFT, EASST, EAST, EAT, ECT, ECT, EDT, EEST, EET, EGST, EGT, EIT, EST, FET, FJT, FKST, FKT, FNT, GALT, GAMT, GET, GFT, GILT, GIT, GMT, GST, GST, GYT, HDT, HAEC, HST, HKT, HMT, HOVST, HOVT, ICT, IDLW, IDT, IOT, IRDT, IRKT, IRST, IST, IST, IST, JST, KALT, KGT, KOST, KRAT, KST, LHST, LHST, LINT, MAGT, MART, MAWT, MDT, MET, MEST, MHT, MIST, MIT, MMT, MSK, MST, MST, MUT, MVT, MYT, NCT, NDT, NFT, NPT, NST, NT, NUT, NZDT, NZST, OMST, ORAT, PDT, PET, PETT, PGT, PHOT, PHT, PKT, PMDT, PMST, PONT, PST, PST, PYST, PYT, RET, ROTT, SAKT, SAMT, SAST, SBT, SCT, SDT, SGT, SLST, SRET, SRT, SST, SST, SYOT, TAHT, THA, TFT, TJT, TKT, TLT, TMT, TRT, TOT, TVT, ULAST, ULAT, UTC, UYST, UYT, UZT, VET, VLAT, VOLT, VOST, VUT, WAKT, WAST, WAT, WEST, WET, WIT, WST, YAKT, YEKT

false

(UTC (default))

New DateTime

DATETIME(@year, @month, @day, @hour, @minute, @second)

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@year

true

-

@month

true

-

@day

true

-

@hour

true

-

@minute

true

-

@second

true

-

DateTime add

DATETIME_ADD(@dateTime_expression, INTERVAL @integer_expression @dateTime_part)

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@integer_expression

true

-

@dateTime_part

any of: year, quarter, month, week, day, hour, minute, second, millisecond, microsecond

true

((e.g., days))

@dateTime_expression

true

-

DateTime diff

DATETIME_DIFF(@dateTime_expression, @dateTime_expression_2, @dateTime_part)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@dateTime_expression_2

true

-

@dateTime_expression

true

-

@dateTime_part

any of: year, quarter, month, week, day, hour, minute, second, millisecond, microsecond

true

((e.g., days))

DateTime subtract

DATETIME_SUB(@dateTime_expression, INTERVAL @integer_expression @dateTime_part)

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@integer_expression

true

-

@dateTime_part

any of: year, quarter, month, week, day, hour, minute, second, millisecond, microsecond

true

((e.g., days))

@dateTime_expression

true

-

DateTime truncate

DATETIME_TRUNC(@dateTime_expression, @dateTime_truncate_part)

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@dateTime_expression

true

-

@dateTime_truncate_part

any of: hour, minute, second, millisecond, microsecond, day, month, quarter, year, week(sunday), week(monday), week(tuesday), week(wednesday), week(thursday), week(friday), week(saturday), ISOquarter, ISOyear

true

((e.g., minutes))

DateTime extract

EXTRACT(@dateTime_part FROM @date_expression)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@dateTime_part

any of: YEAR, QUARTER, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, DAYOFYEAR, DAYOFWEEK, ISOYEAR, ISOWEEK

true

((e.g., hour))

@date_expression

true

-

Format dateTime

FORMAT_DATETIME(@format_string, @dateTime_expression)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@dateTime_expression

true

-

@format_string

true

(e.g., %Y-%m-%d %H:%M:%S)

Parse dateTime

[@safe]PARSE_DATETIME(@format_string, @dateTime_string)

Return type

dateTime

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@dateTime_string

true

-

@format_string

true

(e.g., %Y-%m-%d %H:%M:%S)

@safe

true

-

Navigation

First value

FIRST_VALUE(@variable[ @ignore_null NULLS])

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

@ignore_null

any of: IGNORE, RESPECT

false

(Ignore nulls (default))

Lag

LAG(@variable, @literal[, @default_value])

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

@literal

false

(1)

@default_value

false

(NULL)

Last value

LAST_VALUE(@variable[ @ignore_null NULLS])

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

@ignore_null

any of: IGNORE, RESPECT

false

(Ignore nulls (default))

Lead

LEAD(@variable, @literal[, @default_value])

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

@literal

false

(1)

@default_value

false

(NULL)

Nth value

NTH_VALUE(@variable, @literal[ @ignore_null NULLS])

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

@literal

true

-

@ignore_null

any of: IGNORE, RESPECT

false

(Ignore nulls (default))

Percentile (continuous)

PERCENTILE_CONT(@variable, @literal[ @ignore_null NULLS])

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

(variable)

@literal

true

(Value between [0, 1])

@ignore_null

any of: IGNORE, RESPECT

false

(Ignore nulls (default))

Percentile (discrete)

PERCENTILE_DISC(@variable, @literal[ @ignore_null NULLS])

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

(variable)

@literal

true

(Value between [0, 1])

@ignore_null

any of: IGNORE, RESPECT

false

(Ignore nulls (default))

Date

Current date

CURRENT_DATE([@time_zone])

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@time_zone

any of: ACDT, ACST, ACT, ACT, ACWST, ADT, AEDT, AEST, AFT, AKDT, AKST, AMST, AMT, AMT, ART, AST, AST, AWST, AZOST, AZOT, AZT, BDT, BIOT, BIT, BOT, BRST, BRT, BST, BST, BST, BTT, CAT, CCT, CDT, CDT, CEST, CET, CHADT, CHAST, CHOT, CHOST, CHST, CHUT, CIST, CIT, CKT, CLST, CLT, COST, COT, CST, CST, CST, CT, CVT, CWST, CXT, DAVT, DDUT, DFT, EASST, EAST, EAT, ECT, ECT, EDT, EEST, EET, EGST, EGT, EIT, EST, FET, FJT, FKST, FKT, FNT, GALT, GAMT, GET, GFT, GILT, GIT, GMT, GST, GST, GYT, HDT, HAEC, HST, HKT, HMT, HOVST, HOVT, ICT, IDLW, IDT, IOT, IRDT, IRKT, IRST, IST, IST, IST, JST, KALT, KGT, KOST, KRAT, KST, LHST, LHST, LINT, MAGT, MART, MAWT, MDT, MET, MEST, MHT, MIST, MIT, MMT, MSK, MST, MST, MUT, MVT, MYT, NCT, NDT, NFT, NPT, NST, NT, NUT, NZDT, NZST, OMST, ORAT, PDT, PET, PETT, PGT, PHOT, PHT, PKT, PMDT, PMST, PONT, PST, PST, PYST, PYT, RET, ROTT, SAKT, SAMT, SAST, SBT, SCT, SDT, SGT, SLST, SRET, SRT, SST, SST, SYOT, TAHT, THA, TFT, TJT, TKT, TLT, TMT, TRT, TOT, TVT, ULAST, ULAT, UTC, UYST, UYT, UZT, VET, VLAT, VOLT, VOST, VUT, WAKT, WAST, WAT, WEST, WET, WIT, WST, YAKT, YEKT

false

(UTC (default))

New date

DATE(@year, @month, @day)

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@year

true

-

@month

true

-

@day

true

-

Date add

DATE_ADD(@date_expression, INTERVAL @integer_expression @date_part)

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@integer_expression

true

-

@date_part

any of: year, month, quarter, week, day

true

((e.g., days))

@date_expression

true

-

Date diff

DATE_DIFF(@date_expression, @date_expression_2, @date_part)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@date_expression_2

true

-

@date_expression

true

-

@date_part

any of: day, week, month, quarter, year

true

((e.g., days))

Date subtract

DATE_SUB(@date_expression, INTERVAL @integer_expression @date_part)

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@integer_expression

true

-

@date_part

any of: day, week, month, quarter, year

true

((e.g., days))

@date_expression

true

-

Date truncate

DATE_TRUNC(@date_expression, @date_truncate_part)

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@date_expression

true

-

@date_truncate_part

any of: day, month, quarter, year, week(sunday), week(monday), week(tuesday), week(wednesday), week(thursday), week(friday), week(saturday), ISOquarter, ISOyear

true

((e.g., month))

Date extract

EXTRACT(@date_part FROM @date_expression)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@date_part

any of: DAYOFWEEK, DAY, DAYOFYEAR, WEEK, MONTH, QUARTER, YEAR, ISOWEEK, ISOYEAR

true

((e.g., month))

@date_expression

true

-

Format date

FORMAT_DATE(@format_string, @date_expression)

Return type

string

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@date_expression

true

-

@format_string

true

(e.g., %Y-%m-%d)

Parse date

[@safe]PARSE_DATE(@format_string, @date_string)

Return type

date

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@date_string

true

-

@format_string

true

(e.g., %Y-%m-%d)

@safe

true

-

Math

Absolute value

ABS(@variable)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

Arithmetic

(@expression @operator @expression_2)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

@operator

any of: +, -, *, /

true

(E.g., +)

@expression_2

true

-

Ceiling

CEILING(@variable)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

Integer divide

DIV(@expression, @expression_2)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

@expression_2

true

-

e ^ x

EXP(@expression)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

Floor

FLOOR(@variable)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

Greatest

GREATEST(@expression)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

Is infinity

IS_INF(@variable)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

Is NaN

IS_NAN(@variable)

Return type

boolean

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

Least

LEAST(@expression)

Return type

dynamic (input-dependent)

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

Log

LOG(@expression[, @literal])

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

@literal

false

-

Mod

MOD(@expression, @expression_2)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

@expression_2

true

-

Power

POW(@expression, @expression_2)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

@expression_2

true

-

Random

RAND()

Return type

float

Parameters

Round

ROUND(@expression[, @literal])

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

@literal

false

-

Safe divide

SAFE_DIVIDE(@expression, @expression_2)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

@expression_2

true

-

Sign

SIGN(@variable)

Return type

integer

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@variable

true

-

Sqrt

SQRT(@expression)

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

Truncate

TRUNC(@expression[, @literal])

Return type

float

Parameters

Name
Type
Allowed values
Required
Placeholder (in UI)

@expression

true

-

@literal

false

-

Geography

Angle

Return type

float

Parameters

Area

Return type

float

Parameters

As GeoJSON

Return type

string

Parameters

As text

Return type

string

Parameters

Azimuth

Return type

float

Parameters

Boundary

Return type

geography

Parameters

Buffer

Return type

geography

Parameters

Buffer with tolerance

Return type

geography

Parameters

Centroid

Return type

geography

Parameters

Closest point

Return type

geography

Parameters

Contains

Return type

boolean

Parameters

Convex hull

Return type

geography

Parameters

Covered by

Return type

boolean

Parameters

Covers

Return type

boolean

Parameters

Difference

Return type

geography

Parameters

Dimension

Return type

integer

Parameters

Disjoint

Return type

boolean

Parameters

Distance

Return type

float

Parameters

DWithin

Return type

boolean

Parameters

Endpoint

Return type

geography

Parameters

Equals

Return type

boolean

Parameters

Exterior ring

Return type

geography

Parameters

Geo from

Return type

geography

Parameters

Geo from GeoJSON

Return type

geography

Parameters

Geo from text

Return type

geography

Parameters

Geo from wkb

Return type

geography

Parameters

Geogpoint

Return type

geography

Parameters

Geogpoint from Geohash

Return type

geography

Parameters

Geohash

Return type

string

Parameters

Geometry type

Return type

string

Parameters

Intersection

Return type

geography

Parameters

Intersects

Return type

boolean

Parameters

Intersects box

Return type

boolean

Parameters

Is collection

Return type

boolean

Parameters

Is empty

Return type

boolean

Parameters

Length

Return type

float

Parameters

Make line

Return type

geography

Parameters

Make polygon

Return type

geography

Parameters

Max distance

Return type

float

Parameters

Num geometries

Return type

integer

Parameters

Num points

Return type

integer

Parameters

Perimeter

Return type

float

Parameters

Point N

Return type

geography

Parameters

Simplify

Return type

geography

Parameters

Snap to grid

Return type

geography

Parameters

Start point

Return type

geography

Parameters

Touches

Return type

boolean

Parameters

Union

Return type

geography

Parameters

Within

Return type

boolean

Parameters

X Min

Return type

float

Parameters

X Max

Return type

float

Parameters

Point X

Return type

float

Parameters

Point Y

Return type

float

Parameters

Y Min

Return type

float

Parameters

Y Max

Return type

float

Parameters

Centroid aggregate

Return type

geography

Parameters

Geo aggregate

Return type

geography

Parameters

DBSCAN clustering

Return type

integer

Parameters

any

any ,

any

any

any

any

any

any

any

any

any

any

any ,

The type that the Source variable will be converted to. Can be any of the on Redivis. More on .

Informs how the data will be read by the method. Only relevant for conversions of strings to date, time, and dateTime. .

Note you can see more information on Redivis variable types .

These format elements will only work in the Retype step (or CAST method in SQL). Format elements for using other date parsing or formatting methods are and might be useful if your data is not coercible using the format elements described here.

This common method utilizes if-then-else logic, assigning a result when the corresponding condition evaluates to true, otherwise assigning the final ("else") result –>

Name
Type
Allowed values
Required
Placeholder (in UI)

@condition

(nested) conditions – like those used in a

any

true

-

@result

or

any

true

(Variable or value)

@else_result

or

any

true

(Variable or value)

Extracts a JSON value and converts it to a SQL JSON-formatted STRING. –>

or

any

any

Extracts a JSON scalar value and converts it to a SQL JSON-formatted STRING, removing outermost quotes and un-escaping return values –>

or

any

any

Returns the current dateTime –>

Constructs a dateTime from a year, month, day, hour, minute, and second –>

or

any

or

any

or

any

or

any

or

any

or

any

Add a period of time to a dateTime –>

or

any

or

any

Calculate distance between two dateTimes –>

or

any

or

any

Subtract a period of time from a dateTime –>

or

any

or

any

Truncates a DateTime to the nearest boundary –>

or

any

Extracts the date or time part (e.g, hour) from a DateTime –>

or

any

Returns a formatted string from a dateTime –>

or

any

any

More details about format strings

Parses a dateTime from a string –>

or

any

any

any

More details about format strings

Returns the value of a variable for the first row in a given analytic window. –>

any

Returns the value of a variable on a preceding row within the analytic window –>

any

any

any

Returns the value of a variable for the last row in a given analytic window. –>

any

Returns the value of a variable on a subsequent row within the analytic window –>

any

any

any

Returns the value at the Nth row of a given window frame. –>

any

any

Computes the specified percentile value for a variable within an ordered partition, with linear interpolation. –>

any ,

any

Computes the specified percentile value for a variable within an ordered partition. Returns the first sorted value with cumulative distribution greater than or equal to the percentile. –>

any ,

any

Returns the current date –>

Constructs a date from a year, month, and day –>

or

any

or

any

or

any

Add a period of time to a date –>

or

any

or

any

Calculate the distance between two dates –>

or

any

or

any

Subtract a period of time from a date –>

or

any

or

any

Truncates a date to the nearest boundary –>

or

any

Extracts the date part (e.g, month) from a date –>

or

any , ,

Returns a formatted string from a date –>

or

any

any

More details about format strings

Parses a date from a string –>

or

any

any

any

More details about format strings

Returns the absolute value of a variable –>

any ,

Compute simple arithmetic (+, -, *, /) –>

or

any ,

or

any ,

Returns the smallest integral value that is not less than the provided value –>

any

Divide two integer values, rounding down any remainder –>

or

any

or

any

Compute the natural exponential of a value –>

or

any ,

Returns the largest integral value that is not greater than the provided value –>

any

Find the largest of several values –>

s or s

any , , , , , ,

Return true if the value is positive or negative infinity, false otherwise. Returns NULL for NULL inputs –>

any ,

Determines whether input value is not a number (NaN) –>

any ,

Find the smallest of several values –>

s or s

any , , , , , ,

Compute the logarithm of a value to a provided base; generates an error if the variable is <= 0. If no base is provided, defaults to natural log –>

or

any ,

any ,

Modulo: compute the remainder of the division of two integers –>

or

any

or

any

Raises a value to a power –>

or

any ,

or

any ,

Generate a pseudo-random float between [0, 1) –>

Rounds a value to the nearest integer (or, if specified, to the provided number of decimal places) –>

or

any ,

any

Equivalent to the divisor operator, but returns null if an error occurs, such as a division by zero error –>

or

any ,

or

any ,

Returns the sign (-1, 0, +1) of a numeric variable –>

any ,

Compute the square root of a value; generates an error if the variable is less than 0 –>

or

any ,

Similar to round, but rounds to the nearest integer whose absolute value is not greater than the absolute value of the provided variable (always rounds towards zero) –>

or

any ,

any

Takes three point GEOGRAPHY values, which represent two intersecting lines. Returns the angle between these lines. Point 2 and point 1 represent the first line and point 2 and point 3 represent the second line. The angle between these lines is in radians, in the range [0, 2pi). The angle is measured clockwise from the first line to the second line. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the area in square meters covered by the polygons in the input GEOGRAPHY –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the RFC 7946 compliant GeoJSON representation of the input GEOGRAPHY –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the WKT representation of an input GEOGRAPHY –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Takes two point GEOGRAPHY values, and returns the azimuth of the line segment formed by points 1 and 2. The azimuth is the angle in radians measured between the line from point 1 facing true North to the line segment from point 1 to point 2. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a single GEOGRAPHY that contains the union of the boundaries of each component in the given input GEOGRAPHY. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a GEOGRAPHY that represents the buffer around the input GEOGRAPHY. This function is similar to ST_BUFFERWITHTOLERANCE, but you specify the number of segments instead of providing tolerance to determine how much the resulting geography can deviate from the ideal buffer radius. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a GEOGRAPHY that represents the buffer around the input GEOGRAPHY. This function is similar to ST_BUFFER, but you provide tolerance instead of segments to determine how much the resulting geography can deviate from the ideal buffer radius. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the centroid of the input GEOGRAPHY as a single point GEOGRAPHY. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a GEOGRAPHY containing a point on Geography 1 with the smallest possible distance to Geography 2. This implies that the distance between the point returned by ST_CLOSESTPOINT and Geography 2 is less than or equal to the distance between any other point on Geography 1 and Geography 2. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns TRUE if no point of Geography 2 is outside Geography 1, and the interiors intersect; returns FALSE otherwise. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the convex hull for the input GEOGRAPHY. The convex hull is the smallest convex GEOGRAPHY that covers the input. A GEOGRAPHY is convex if for every pair of points in the GEOGRAPHY, the geodesic edge connecting the points are also contained in the same GEOGRAPHY. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns FALSE if Geography 1 or Geography 2 is empty. Returns TRUE if no points of Geography 1 lie in the exterior of Geography 2. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns FALSE if Geography 1 or Geography 2 is empty. Returns TRUE if no points of Geography 2 lie in the exterior of Geography 1. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a GEOGRAPHY that represents the point set difference of Geography 1 and Geography 2. Therefore, the result consists of the part of Geography 1 that does not intersect with Geography 2. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the dimension of the highest-dimensional element in the input GEOGRAPHY. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns TRUE if the intersection of Geography 1 and Geography 2 is empty, that is, no point in Geography 1 also appears in Geography 2. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the shortest distance in meters between two non-empty GEOGRAPHYs. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns TRUE if the distance between at least one point in Geography 1 and one point in Geography 2 is less than or equal to the Distance argument, otherwise, returns FALSE –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the last point of a linestring geography as a point geography. Returns an error if the input is not a linestring or if the input is empty. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns TRUE if Geography 1 and Geography 2 represent the same GEOGRAPHY value. More precisely, this means that one of the following conditions holds: + ST_COVERS(geography, geography_2) = TRUE and ST_COVERS(geography_2, geography) = TRUE + Both Geography 1 and Geography 2 are empty. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a linestring geography that corresponds to the outermost ring of a polygon geography. If the input geography is a polygon, gets the outermost ring of the polygon geography and returns the corresponding linestring. If the input is the full GEOGRAPHY, returns an empty geography. Returns an error if the input is not a single polygon. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Converts an expression for a STRING or BYTES value into a GEOGRAPHY value. If expression represents a STRING value, it must be a valid GEOGRAPHY representation in one of the following formats: WKT, WKB, GeoJSON –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a GEOGRAPHY value that corresponds to the input GeoJSON representation. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a GEOGRAPHY value that corresponds to the input WKT representation. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Converts an expression for a hexadecimal-text STRING or BYTES value into a GEOGRAPHY value. The expression must be in WKB format –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Creates a GEOGRAPHY with a single point. ST_GEOGPOINT creates a point from the specified FLOAT64 longitude and latitude parameters and returns that point in a GEOGRAPHY value. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a GEOGRAPHY value that corresponds to a point in the middle of a bounding box defined in the GeoHash –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a GeoHash representation of geography_expression. The resulting GeoHash will contain at most Max chars characters. Fewer characters corresponds to lower precision (or, described differently, to a bigger bounding box) –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the Open Geospatial Consortium (OGC) geometry type that describes the input GEOGRAPHY as a STRING. The OGC geometry type matches the types that are used in WKT and GeoJSON formats and printed for ST_ASTEXT and ST_ASGEOJSON. ST_GEOMETRYTYPE returns the OGC geometry type with the "ST_" prefix. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a GEOGRAPHY that represents the point set intersection of the two input GEOGRAPHYs. Thus, every point in the intersection appears in both Geography 1 and Geography 2 –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns TRUE if the point set intersection of Geography 1 and Geography 2 is non-empty. Thus, this function returns TRUE if there is at least one point that appears in both input GEOGRAPHYs. If ST_INTERSECTS returns TRUE, it implies that ST_DISJOINT returns FALSE. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns TRUE if geography intersects the rectangle between [lng1, lng2] and [lat1, lat2]. The edges of the rectangle follow constant lines of longitude and latitude. lng1 and lng2 specify the westmost and eastmost constant longitude lines that bound the rectangle, and lat1 and lat2 specify the minimum and maximum constant latitude lines that bound the rectangle. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns TRUE if the total number of points, linestrings, and polygons is greater than one. An empty GEOGRAPHY is not a collection. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns TRUE if the given GEOGRAPHY is empty; that is, the GEOGRAPHY does not contain any points, lines, or polygons. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the total length in meters of the lines in the input GEOGRAPHY. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Creates a GEOGRAPHY with a single linestring by concatenating the point or line vertices of each of the input GEOGRAPHYs in the order they are given. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Creates a GEOGRAPHY containing a single polygon from a linestring input, where the input linestring is used to construct a polygon ring. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the longest distance in meters between two non-empty GEOGRAPHYs; that is, the distance between two vertices where the first vertex is in the first GEOGRAPHY, and the second vertex is in the second GEOGRAPHY. If Geography 1 and Geography 2 are the same GEOGRAPHY, the function returns the distance between the two most distant vertices in that GEOGRAPHY. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the number of geometries in the input GEOGRAPHY. For a single point, linestring, or polygon, ST_NUMGEOMETRIES returns 1. For any collection of geometries, ST_NUMGEOMETRIES returns the number of geometries making up the collection. ST_NUMGEOMETRIES returns 0 if the input is the empty GEOGRAPHY. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the number of vertices in the input GEOGRAPHY. This includes the number of points, the number of linestring vertices, and the number of polygon vertices. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the length in meters of the boundary of the polygons in the input GEOGRAPHY. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the Nth point of a linestring geography as a point geography, where N is the index. The index is 1-based. Negative values are counted backwards from the end of the linestring, so that -1 is the last point. Returns an error if the input is not a linestring, if the input is empty, or if there is no vertex at the given index. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a simplified version of geography, the given input GEOGRAPHY. The input GEOGRAPHY is simplified by replacing nearly straight chains of short edges with a single long edge. The input geography will not change by more than the tolerance specified by tolerance_meters. Thus, simplified edges are guaranteed to pass within tolerance_meters of the original positions of all vertices that were removed from that edge. The given tolerance_meters is in meters on the surface of the Earth. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the input GEOGRAPHY, where each vertex has been snapped to a longitude/latitude grid. The grid size is determined by the grid_size parameter which is given in degrees. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the first point of a linestring geography as a point geography. Returns an error if the input is not a linestring or if the input is empty. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns TRUE provided the following two conditions are satisfied: (1) Geography 1 intersects Geography 2 and (2) the interior of Geography 1 and the interior of Geography 2 are disjoint. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a GEOGRAPHY that represents the point set union of all input GEOGRAPHYs. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns TRUE if no point of Geography 1 is outside of Geography 2 and the interiors of Geography 1 and Geography 2 intersect. Given two geographies a and b, ST_WITHIN(a, b) returns the same result as ST_CONTAINS(b, a). Note the opposite order of arguments. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a float representing the west-most constant longitude line that bounds the geometry –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a float representing the east-most constant longitude line that bounds the geometry –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the longitude in degrees of the single-point input GEOGRAPHY –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns the latitude in degrees of the single-point input GEOGRAPHY –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a float representing the minimum constant latitude line that bounds the geometry. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a float representing the maximum constant latitude line that bounds the geometry. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Computes the centroid of the set of input GEOGRAPHYs as a single point GEOGRAPHY. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Returns a geography variable that represents the point set union of all input geographies. –>

Name
Type
Allowed values
Required
Placeholder (in UI)

Identifies high-density geography clusters and marks outliers in low-density areas of noise –>

Name
Type
Allowed values
Required
Placeholder (in UI)
variable
Redivis type
variable
integer
float
variable
Redivis type
boolean
boolean
variable
boolean
variable
boolean
variable
Redivis type
variable
Redivis type
variable
string
literal
Redivis type
boolean
boolean
literal
Redivis type
variable
integer
float
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
here
learn more
here
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
here
learn more
here
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
here
detailed elsewhere
ST_ANGLE(@geography, @geography_2, @geography_3)
ST_AREA(@geography)
ST_ASGEOJSON(@geography)
ST_ASTEXT(@geography)
ST_ANGLE(@geography, @geography_2)
ST_BOUNDARY(@geography)
ST_BUFFER(@geography, @buffer_radius[, num_seg_quarter_circle => @num_seg_quarter_circle][, endcap => @endcap][, side => @side])
ST_BUFFERWITHTOLERANCE(@geography, @buffer_radius, @tolerance_meters[, endcap => @endcap][, side => @side])
ST_CENTROID(@geography)
ST_CLOSESTPOINT(@geography, @geography_2)
ST_CONTAINS(@geography, @geography_2)
ST_CONVEXHULL(@geography)
ST_COVEREDBY(@geography, @geography_2)
ST_COVERS(@geography, @geography_2)
ST_DIFFERENCE(@geography, @geography_2)
ST_DIMENSION(@geography)
ST_DISJOINT(@geography, @geography_2)
ST_DISTANCE(@geography, @geography_2)
ST_DWITHIN(@geography, @geography_2, @distance)
[@safe]ST_ENDPOINT(@geography)
ST_EQUALS(@geography, @geography_2)
[@safe]ST_EXTERIORRING(@geography)
ST_GEOGFROM(@expression)
ST_GEOGFROMGEOJSON(@geojson[, @make_valid])
ST_GEOGFROMTEXT(@wkt[, @oriented][, @planar][, @make_valid])
ST_GEOGFROMWKB(@wkb)
ST_GEOGPOINT(@longitude, @latitude)
ST_GEOGPOINTFROMGEOHASH(@geohash)
ST_GEOHASH(@geography, @maxchars)
ST_GEOMETRYTYPE(@geography)
ST_INTERSECTION(@geography, @geography_2)
ST_INTERSECTS(@geography, @geography_2)
ST_INTERSECTSBOX(@geography, @lng1, @lat1, @lng2, @lat2)
ST_ISCOLLECTION(@geography)
ST_ISEMPTY(@geography)
ST_LENGTH(@geography)
ST_MAKELINE(@geography, @geography_2)
ST_MAKEPOLYGON(@geography)
ST_MAXDISTANCE(@geography, @geography_2)
ST_NUMGEOMETRIES(@geography)
ST_NUMPOINTS(@geography)
ST_PERIMETER(@geography)
[@safe]ST_POINTN(@geography, @index)
ST_SIMPLIFY(@geography, @tolerance_meters)
ST_SNAPTOGRID(@geography, @grid_size)
[@safe]ST_STARTPOINT(@geography)
ST_TOUCHES(@geography, @geography_2)
ST_UNION(@geography, @geography_2)
ST_WITHIN(@geography, @geography_2)
ST_BOUNDINGBOX(@geography).xmin
ST_BOUNDINGBOX(@geography).xmax
[@safe]ST_X(@geography)
[@safe]ST_Y(@geography)
ST_BOUNDINGBOX(@geography).ymin
ST_BOUNDINGBOX(@geography).ymax
ST_CENTROID_AGG(@geography)
ST_UNION_AGG(@geography)
ST_CLUSTERDBSCAN(@geography, @epsilon, @minimum_geographies)
ST_GEOGFROM
ST_ASTEXT
DOI-issuing credentials
Settings
custom compute configuration
audit logs
entity
organization
entity
organizations
null
null
null
overview access
overview access
overview
metadata
metadata access
data access
data access
data access
metadata access
data access
data access
metadata access
metadata access
overview access
dataset editor
control access to that sample
data access
See more examples in the SAS notebooks reference.
variable
literal
string
literal
string
variable
literal
string
literal
string
enum
variable
literal
integer
variable
literal
integer
variable
literal
integer
variable
literal
integer
variable
literal
integer
variable
literal
integer
variable
literal
integer
enum
variable
literal
dateTime
variable
literal
dateTime
variable
literal
dateTime
enum
variable
literal
integer
enum
variable
literal
dateTime
variable
literal
dateTime
enum
enum
variable
literal
dateTime
variable
literal
dateTime
literal
string
variable
literal
string
literal
string
boolean
boolean
variable
Redivis type
enum
variable
Redivis type
literal
integer
literal
Redivis type
variable
Redivis type
enum
variable
Redivis type
literal
integer
literal
Redivis type
variable
Redivis type
literal
integer
enum
variable
integer
float
literal
float
enum
variable
integer
float
literal
float
enum
enum
variable
literal
integer
variable
literal
integer
variable
literal
integer
variable
literal
integer
enum
variable
literal
date
variable
literal
date
variable
literal
date
enum
variable
literal
integer
enum
variable
literal
date
variable
literal
date
enum
enum
variable
literal
date
dateTime
time
variable
literal
date
literal
string
variable
literal
string
literal
string
boolean
boolean
variable
integer
float
variable
literal
integer
float
enum
variable
literal
integer
float
variable
float
variable
literal
integer
variable
literal
integer
variable
literal
integer
float
variable
float
variable
literal
integer
float
date
dateTime
time
string
boolean
variable
integer
float
variable
integer
float
variable
literal
integer
float
date
dateTime
time
string
boolean
variable
literal
integer
float
literal
integer
float
variable
literal
integer
variable
literal
integer
variable
literal
integer
float
variable
literal
integer
float
variable
literal
integer
float
literal
integer
variable
literal
integer
float
variable
literal
integer
float
variable
integer
float
variable
literal
integer
float
variable
literal
integer
float
literal
integer
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
learn more
supported variable types
conversion rules
More on format elements
Redivis type
variable
literal
Redivis type
variable
literal
Redivis type
Filter
notifications
dataset labels
member labels
specified here
specified here
corresponding export quotas and limits >
publicly visible
redivis-python library documentation
redivis-R library documentation
redivis-python library documentation
redivis-R library documentation
Learn more about referencing tables on Redivis >

@geography

true

-

@geography_2

true

-

@geography_3

true

-

@geography

true

-

@geography

true

-

@geography

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography

true

-

@buffer_radius

true

-

@num_seg_quarter_circle

true

-

@endcap

any of: ROUND, FLAT

false

(Round (default))

@side

any of: BOTH, LEFT, RIGHT

false

(Both (default))

@geography

true

-

@buffer_radius

true

-

@tolerance_meters

true

-

@endcap

any of: ROUND, FLAT

false

(Round (default))

@side

any of: BOTH, LEFT, RIGHT

false

(Both (default))

@geography

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography_2

true

-

@distance

true

-

@geography

true

-

@safe

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@safe

true

-

@expression

true

-

@geojson

true

-

@make_valid

true

-

@wkt

true

-

@oriented

true

-

@planar

true

-

@make_valid

true

-

@wkb

true

-

@longitude

true

-

@latitude

true

-

@geohash

true

-

@geography

true

-

@maxchars

true

-

@geography

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@lng1

true

-

@lat1

true

-

@lng2

true

-

@lat2

true

-

@geography

true

-

@geography

true

-

@geography

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography

true

-

@geography

true

-

@index

true

-

@geography

true

-

@safe

true

-

@geography

true

-

@tolerance_meters

true

-

@geography

true

-

@grid_size

true

-

@geography

true

-

@safe

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography_2

true

-

@geography

true

-

@geography

true

-

@geography

true

-

@safe

true

-

@geography

true

-

@safe

true

-

@geography

true

-

@geography

true

-

@geography

true

-

@geography

true

-

@geography

true

-

@epsilon

true

-

@minimum_geographies

true

-

data access to the table

any

any

any

any

any

any

any

any

any

any

or

any ,

or

any ,

any

or

any ,

or

any ,

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any

or

any ,

any

any

any

any

any

any

or

any

or

any

any

or

any

any

any

any

or

any

or

any ,

or

any ,

or

any

any

or

any

any

any

any

any

any

any

or

any ,

or

any ,

or

any ,

or

any ,

any

any

any

any

any

any

any

any

any

any

any

or

any

any

any

any

or

any ,

any

or

any ,

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any

any ,

any

redivis-python
redivis-r
REST API
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
literal
integer
float
variable
literal
integer
float
enum
enum
variable
geography
variable
literal
integer
float
variable
literal
integer
float
enum
enum
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
literal
integer
float
variable
geography
boolean
boolean
variable
geography
variable
geography
variable
geography
boolean
boolean
variable
literal
string
variable
literal
string
boolean
boolean
variable
literal
string
boolean
boolean
boolean
boolean
boolean
boolean
variable
literal
string
variable
literal
integer
float
variable
literal
integer
float
variable
literal
string
variable
geography
variable
literal
integer
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
literal
integer
float
variable
literal
integer
float
variable
literal
integer
float
variable
literal
integer
float
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
literal
integer
variable
geography
boolean
boolean
variable
geography
variable
literal
integer
float
variable
geography
variable
literal
integer
float
variable
geography
boolean
boolean
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
boolean
boolean
variable
geography
boolean
boolean
variable
geography
variable
geography
variable
geography
variable
geography
variable
geography
literal
integer
float
literal
integer
Transform #2: Output‌
Configure our notebook to include a GPU
Specify dependencies by clicking the "Dependencies" button
Observable interactive visualzations
Raw data used to build the above globe and histogram plots
View files in Redivis
Yes, Redivis supports cat images
Add editors to a dataset in the access configuration
Query the table to filter for a particular value, or explore variable correlations via a cross-tab
Look at univariate statistics by selecting a variable
Authorize your Google and Redivis accounts
Specify your Redivis table and click "Connect"
Specify "Owner" credentials to share with users without a Redivis account.
Export any table by clicking the "Export table" button
Download metadata
gain approval to export the table
apply for approval
export restrictions
export restrictions
request export approval
Reports
See more examples in the R notebooks reference.
See more examples in the R notebooks reference.
Working with unstructured data files in R >