Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Redivis is a platform for data distribution and analysis that prioritizes open access and FAIR data practices. We believe that supporting open science principles drives better research outcomes, collaboration, and scientific impact.
Redivis is creating tools not just for data distribution and analysis, but also to help drive forward standards for transparent research that is accessible to all.
Every Redivis dataset has a versioned history that is automatically generated while the data owner uploads and edits data contents. This history is fully exposed to end users and the dataset can be accessed at any version. Citations are generated for every version of a dataset which include the ORCID iDs for all creators. Data owners can opt to create Digital Object Identifier (DOI) for all versions of their dataset, which is available at a persistent URL along with the required metadata describing the content at that point in time.
As users analyze data in a Redivis workflow each step they take is recorded in a visual tree that can be navigated sequentially. Every step has a full history which can be viewed at any point in time.
Datasets in any format can be deposited into Redivis and made available immediately to the world via open access licenses at no cost to the end user.
All Redivis datasets and analysis workflows are available to anyone on the internet without a Redivis account unless the owner decides to add restrictions. In this case someone wishing to view the work would need to create a free Redivis account to request access from the owner.
The stable technical infrastructure and dedicated Redivis team is funded by a coalition of academic institutions paying license fees for large scale data distribution. This diverse group of funders ensures the long-term availability of the datasets.
Retaining data is vital to establishing long-lasting data linkages and making research available. Redivis' data retention policy is aligned with the goal of keeping all data available and accessible whenever possible.
Data owners always have the ability to remove their data from Redivis or make it inaccessible if they choose to do so.
Redivis also has policies in place in case of an accidental deletion:
All data is fully backed up, with point-in-time recovery, over a 7 day rolling window. Deletions and modifications to data is permanent after 7 days.
All metadata is backed up with point-in-time recovery over a 7 day rolling window, with daily backups of metadata extending to 1 year.
Data storage that exceeds the free tier must be paid for by the data owner or other sponsor. Upon payment failure, every effort will be made to contact the data owner and re-establish payment.
If unsuccessful, data may be destroyed after 30 days of non-payment, though all metadata, and permanent landing pages will be persisted.
The ability to view, reproduce, and build upon other works is a core tenet of the scientific process that Redivis is built to uphold and facilitate. Redivis tools for data storage and analysis comply with federal mandates for open access and automatically generate artifacts documenting all steps in the data process, ensuring reproducibility.
In August 2022, the United States Office of Science and Technology Policy (OSTP) released a memorandum on public access to federally funded research (known as “Nelson Memo”) which outlined new requirements affecting both faculty and students who conduct research using federal funding:
Make publications and their supporting data resulting from federally funded research publicly accessible without an embargo on their free and public release.
Enact transparent procedures that ensure scientific and research integrity is maintained in public access policies.
Ensure equitable delivery of federally funded research results and data.
Redivis is built to make data sharing as easy and safe as possible, defaulting to fully accessible data, metadata, and documentation. However, some data must be restricted for privacy and to comply with data licensing agreements. In these cases Redivis follows an "as much as securely possible" sharing philosophy for sharing.
Multiple levels of access allows data owners to grant access to documentation, metadata, samples, and data separately, so as much access as possible can be granted.
When research is done using restricted data, analysis code is made available when possible.
Viewing public data and analysis workflows on Redivis is completely public and does not require a Redivis account.
If any piece of analysis or data is restricted, a user will need a Redivis account to apply for access. Redivis accounts are completely free with no restrictions.
Redivis systems are built to automatically capture and document any work done on datasets and analyses in standardized formats, while giving researchers and administrators the ability to supplement or override these when necessary.
Redivis offers a no-code interface to build transform queries that compiles to SQL code. This code is available to the analyst and any viewers.
Workflows are self-documenting. Every step taken in an analysis workflow is recorded sequentially in a visual format that is easy to follow.
Workflows are version controlled. Every step is versioned and time-stamped, allowing users to revert to a previous iteration, or a viewer to understand how queries might have changed over time.
Workflow can be forked, preserving relationships between workflows and datasets.
Redivis is free and accessible to any researcher, reviewer, or casual data browser. By taking multiple complex data storage and analysis systems and centralizing them with clear UI, it is clearer to understand full workflows without specific technical knowledge.
Redivis uses transparent, open-source formats, allowing data, analyses, and workflows to be exported and used within other systems.
No-code interfaces allow researchers at any skill level to build and understand research steps.
Common, open-source coding languages (SQL, Python, R) are a foundation of tools and supported in analysis workflows.
There is no cost to researchers to store personal data, or run data queries.
Redivis is designed to support data-driven research throughout the research lifecycle, and can serve as a permanent repository for your data, analytical workflows, and data derivatives.
In May 2022, the Subcommittee on Open Science (SOS) of the United States Office of Science and Technology Policy (OSTP) released a document outlining the "desirable characteristics of data repositories".
These characteristics are intended to help agencies direct Federally funded researchers toward repositories that enable management and sharing of research data consistent with the principles of FAIR data practices. Various agencies have adopted these guidelines, including the NIH.
Redivis is specifically designed to meet these desirable characteristics of a data repository, outlined below:
Curation and Quality Assurance
Assigns datasets a citable, unique persistent identifier, such as a digital object identifier (DOI) or accession number, to support data discovery, reporting, and research assessment. The identifier points to a persistent landing page that remains accessible even if the dataset is de-accessioned or no longer available.
Data can be uploaded to datasets within an organization, where every version of that dataset is assigned a DOI through the organization's DOI-issuing credentials. DOIs will always resolve to the URL of the dataset page. In the case when a dataset is restricted or deleted, base metadata will remain available.
Has a plan for long-term management of data, including maintaining integrity, authenticity, and availability of datasets; building on a stable technical infrastructure and funding plans; and having contingency plans to ensure data are available and maintained during and after unforeseen events.
Redivis uses highly-available and redundant Google Cloud infrastructure to ensure data is stored securely and to the highest technical standards. Redivis maintains a formal disaster recovery and business continuity plan that is regularly exercised to ensure our ability to maintain availability and data durability during unforeseen events.
Redivis undergoes annual security audits and penetration testing by an external firm, and utilizes a formalized software development and review process to maintain the soundness of its technical infrastructure.
Funding for Redivis is provided by recurring annual subscriptions from its member academic institutions. This model provides consistent annual revenue to support the ongoing maintenance of the platform. Redivis is an employee-owned company without any external investors with an equity stake, allowing us solely focus on the needs of our customers, our employees, and our mission of improving accessibility in the research data science.
Ensures datasets are accompanied by metadata to enable discovery, reuse, and citation of datasets, using schema that are appropriate to, and ideally widely used across, the community(ies) the repository serves. Domain-specific repositories would generally have more detailed metadata than generalist repositories.
All Redivis datasets contain extensive metadata and documentation. Some metadata fields (including summary statistics for all variables) are automatically generated. Some fields are optional depending on the editor's insight and preference. Every dataset has space for short and long-form text, supporting files, and links, alongside variable labels and descriptions.
As a generalist repository, the metadata schema is intentionally broad and flexible, but specific groups on Redivis can choose to enforce more specific metadata standards within their datasets.
Provides, or has a mechanism for others to provide, expert curation and quality assurance to improve the accuracy and integrity of datasets and metadata.
All datasets on Redivis are owned either by an Organization (curated by any administrator) or an individual User. Additional users may be added as editors to a dataset, so as to provide further curation and quality assurance.
It is ultimately up to the editors of a dataset to provide curation, though Redivis is designed to support this process as much as possible. Redivis automatically computes checksums and runs fixity checks on all uploaded files, and computes univariate summary statistics of all variables to aid in the quality assurance process. Metadata completeness is also reported to editors, encouraging them to provide as much information as possible.
Provides broad, equitable, and maximally open access to datasets and their metadata free of charge in a timely manner after submission, consistent with legal and ethical limits required to maintain privacy and confidentiality, Tribal sovereignty, and protection of other sensitive data.
Datasets and workflows can be explored on Redivis without any requirement to have an account or log in. If someone would need to apply to access restricted data, or analyze and download data they will need to make an account to do so. All individual accounts are complete free and require no specific affiliation. Data access restrictions are set and maintained by the data owner.
Makes datasets and their metadata available with broadest possible terms of reuse; and provides the ability to measure attribution, citation, and reuse of data (i.e., through assignment of adequate metadata and unique PIDs).
The data owner can choose to publish any dataset publicly or set appropriate access restrictions based on the sensitivity of the data. Redivis imposes no additional limits on the availability and reuse of data. Any dataset that has a DOI can be cited and tracked in publications, and any reuse on Redivis is displayed on the dataset's usage tab.
Provides accompanying documentation describing terms of dataset access and use (e.g., particular licenses, need for approval by a data use committee).
Dataset owners can describe any usage agreements in their access requirements and have space to document any additional usage rules. Redivis imposes no additional limits on the availability and use of data.
Has documented measures in place to meet generally accepted criteria for preventing unauthorized access to, modification of, or release of data, with levels of security that are appropriate to the sensitivity of data.
Redivis is SOC2 certified and prioritizes technical security. There are multiple layers of administrative controls to make it clear what actions data owners are taking, and all actions whether administrator or researcher are automatically logged for review.
Redivis is well-designed to handle workflows around sharing of sensitive and high-risk data. It provides technical mechanisms to support the reuse of sensitive data when allowed, while enabling the enforcement of appropriate guardrails and access controls defined by data administrators.
Has documented capabilities for ensuring that administrative, technical, and physical safeguards are employed to comply with applicable confidentiality, risk management, and continuous monitoring requirements for sensitive data.
Redivis is SOC2 certified and prioritizes technical security. All data is encrypted in transit and at rest, and stored on Google Cloud infrastructure that maintains robust technical and physical access controls.
In addition to an annual security audit, Redivis also undergoes annual penetration testing by an outside firm to further ensure the soundness of its security posture.
Allows datasets and metadata downloaded, accessed, or exported from the repository to be in widely used, preferably non-proprietary, formats consistent with those used in the community(ies) the repository serves.
Data and metadata on Redivis can be imported and exported in multiple common formats. Data analysis is performed in common, generally open-source programming languages (SAS and Stata being available exceptions). Redivis does not introduce any of its own proprietary formats or programming languages.
Has mechanisms in place to record the origin, chain of custody, and any modifications to submitted datasets and metadata.
All datasets contain Provenance documentation, which is automatically populated based on administrator actions. This information can be further edited or supplemented with additional related identifiers.
All modifications to a dataset or tracked in the adminstrative logs.
Provides documentation on policies for data retention within the repository.
Redivis publishes a data retention policy. Data is always owned by the user or organization who uploads the dataset, and they have control over a dataset's presence and availability. The dataset owner may apply additional policies towards data retention.
Uses documented procedures to restrict dataset access and use to those that are consistent with participant consent and changes in consent.
Redivis provides extensive options to limit access on sensitive or restricted data, including allowing access on different levels (e.g. metadata, sample, data). Access is granted and revoked on Redivis instantly, allowing for immediate changes in access based on changing circumstances.
Access rules on Redivis are defined and enforced by the dataset owner.
Uses documented procedures to communicate and enforce data use restrictions, such as preventing reidentification or redistribution to unauthorized users.
Redivis has built in data restrictions available to administrators. Data download or export can be restricted completely or limited only to administrator approval in order to prevent redistribution.
Data administrators can also communicate and collect formal acknowledgement of other use restrictions through access requirements. Moreover, data administrators can easily audit the use of restricted data in the audit logs to further check for and limit any non-compliance or other misuse.
Implements and provides documentation of measures (for example, tiered access, credentialing of data users, security safeguards against potential breaches) to protect human subjects’ data from inappropriate access.
Redivis has a built-in tiered access system that allows administrators to restrict access to sensitive data. These controls apply to data derivatives as well, where any data output inherits the access rules of the source dataset(s) used to create that output. These controls on derivative data allow researchers to share analyses with colleagues without worrying about accidentally leaking sensitive information, since all collaborators will need to comply with the access rules in order to view those outputs.
Additional technical safeguards ensure data privacy. Data users can establish their identity through their institutional identity provider, and Redivis undergoes regular security audits to ensure the soundness of its security posture.
Has security measures that include a response plan for detected data breaches.
Redivis has detailed internal security protocols and documented security breach plans which are regularly exercised by technical personnel via tabletop exercises. All systems are continuously monitored for potential breaches, with immediate alert pathways and clear escalation protocols to respond to any breach.
Controls and audits access to and download of datasets (if download is permitted).
Redivis has built in data restrictions available to administrators. Data download or export can be restricted completely or limited only to specific external systems / upon administrator approval. All data downloads are logged for subsequent audit and review. Downloads are only available to authenticated users who have access to the underlying data.
Has procedures for addressing violations of terms-of-use by users and data mismanagement by the repository.
Violations of the terms-of-use may lead to account suspension or revocation, as outlined in the terms. Additional technical controls are in place to prevent abuse or misuse of Redivis's systems.
As a matter of policy, Redivis aims to be as permissive as possible, recognizing that often misuse is the result of accidental behavior or misunderstanding. These controls are designed to protect the system for all users, and are not intended to ever be punitive towards good-faith actors.
Makes use of an established and transparent process for reviewing data access requests.
Administrators can define access requirements on any restricted dataset. These access requirements are transparent to all users, and researchers must apply and be approved for a given set of requirements in order to gain access. Requirements also have space for both administrators and applicants to leave comments specifically in the context of the data application.
Data is findable when (1) data and metadata are assigned a globally unique and persistent identifier, (2) data are described with rich metadata, (3) metadata clearly and explicitly include the identifier of the data it describes, and data and (4) metadata and data are registered or indexed in a searchable resource.
At Redivis:
Data is accessible when: (1) data and metadata are retrievable by their identifier using a standardized communications protocol, (2) the protocol is open, free, and universally implementable, (3) the protocol allows for an authentication and authorization procedure, where necessary, (4) metadata are accessible, even when the data are no longer available.
At Redivis:
Data can be explored through any web browser.
Researcher accounts to apply for data access or do analyses are always free.
Data is interoperable when: (1) data and metadata use a formal, accessible, shared, and broadly applicable language for knowledge representation, (2) data and metadata use vocabularies that follow FAIR principles, (3) data and metadata include qualified references to other data and metadata.
At Redivis:
Data is reusable when: (1) data and metadata are richly described with a plurality of accurate and relevant attributes, (2) data and metadata are released with a clear and accessible data usage license, (3) data and metadata are associated with detailed provenance, and (4) data and metadata meet domain-relevant community standards
At Redivis:
Many dataset, table, and variable metadata fields are automatically populated based on user action, with the ability to be manually adjusted.
All datasets have a Usage tab with information on how it has been viewed and used across Redivis
Redivis cloud-based analysis tools encourage data users to do analysis along the data rather than downloading it and breaking linkages.
Organization administrators upload, catalogue, and grant access to datasets.
Researchers discover, query., and analyze data.
Redivis makes it easy for everyone to:
This documentation is broken up into Guides, Examples, and Reference sections.
We recommend that new users begin by following the relevant guides and examples, while the reference section can serve as a comprehensive overview of everything that Redivis can do.
Redivis is a platform that allows researchers to seamlessly discover, access, and analyze data. This brief guide will walk you through the basics to get up and running and provide a launching point for exploring other resources in this documentation.
Many datasets on Redivis are public, and you can browse them without creating an account. However, you'll need an account in order to analyze data, as well as to apply for access to restricted datasets.
Click the Create account button in the top right of any page to sign up. You can use your academic institution's login credentials or any Google account.
Once you create an account, you'll be navigated to your Workspace, which is your private area for creating workflows to work with data, uploading your own datasets, and managing your Redivis account.
The main entity on Redivis is a dataset. Most datasets on Redivis are uploaded by organizations.
You might already know the organization that you’ll be working with, or you can browse organizations by clicking the Find organizations button on the Organizations tab your Workspace. You might know your organization's URL already, or see it listed in the Recent section of the Dashboard if you were already viewing that page.
Once you join an organization, you'll be able to apply for any of their restricted datasets, and also see a link on the left bar of your workspace to quickly get back to the organization's home page.
The best place to find datasets on Redivis is to search on an organization's home page. Here you can click on the Datasets tab to browse all datasets and filter by their metadata. All searches perform a full-text search across a dataset, its documentation, tables, variables, and rich metadata content.
Click on any dataset title to go to that Dataset page where you can view the data and metadata it contains. The data for this dataset will be available on the Tables tab, which you can explore further.
You will also probably come across restricted datasets. For these you will need to click the Apply for access button on the top right of the Dataset page and complete the requirements to gain approval from the dataset's administrators.
Once you've found a dataset that you want to work with, you can add it to a Workflow. Workflows are the fundamental analysis interface on Redivis, where you can query, merge, reshape, and analyze any dataset that you have access to — all from within your web browser.
Add a dataset to a workflow by clicking the Add to workflow button on the top right of the dataset page.
In a workflow we can create a Transform by selecting any table and clicking the +Transform button, which allows us to combine and reshape our data into a final output table that best serves our analysis. These transforms use a powerful SQL engine under the hood, allowing us to query incredibly large tables - even billions of records - in seconds.
After creating output tables for analysis, we can create a computational Notebook in R, Python, Stata, or SAS to further analyze our data and develop our final figures. Select any table and click the +Notebook button to get started. The notebook will initialize and pull in the table you've selected (or a sample of the table if it is a large table).
We can also export and query this table from external environments, allowing us to use whatever analytical tools best suit our research question by clicking the Export table button on the right side of any table. Here we can download the table in a number of common formats, interface with the it programmatically via the API, or export the table to supported integrations.
Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).
Administer your organization
Organizations allow for research groups and centers to securely distribute data to their research community. Organization administrators can create datasets, manage access, review logs, and create customized reports of their data utilization.
Most data on Redivis is uploaded through an Organization, and that is the best place to find Datasets to work with. Some datasets will be public, while others will require certain steps before you can get full access to the data.
From an Organization's home page, click on the Datasets tab to browse all datasets and filter by their metadata.
You can use filters on the left bar to narrow down your results. All searches perform a full-text search across a dataset, its documentation, tables, variables, and rich metadata content.
Click on any dataset title to go to that Dataset page.
You can view the metadata (including an abstract, documentation, citation information, and a version history) and dig into the data directly on the Tables and/or Files tab.
Each table contains a data browser to view cells, generate summary statistics, and display metadata for this table and each variable. Each file can be previewed and downloaded. Both tables and files can be added to workflows for analysis.
Many datasets on Redivis are public, while others have requirements for certain levels of access enforced by the data owner.
On the right side of the dataset, you can see your current access level. If you have Metadata access you can see the variable information and summary statistics, but you'll need to gain Data access in order to view the data contents or work with this data.
On any restricted dataset you can click the Apply for access button in the top right of the page to see a list of steps required for you to gain different access levels to this dataset.
The first step is to become a member of this organization. When you go to apply you will be prompted to become a member of this organization using the credentials on your account.
Once you're a member of this organization, you might need to fill out some requirements that administrators have set up, or request access directly from them.
Requirements are forms that you will need to complete on this page and then submit. Once these are approved by your organization's administrators you will gain access to the dataset.
Once you have data you're interested in, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.
Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).
Datasets are the core entity on Redivis, and finding the dataset(s) you want to work with is generally one of your first steps. All datasets are either uploaded by an organization or a user. If a user has uploaded a dataset they will need to share it with you directly, but you can browse datasets upload by any organization on their home page.
There are many ways to find your organization. If you know what you're looking for you can type it into the "Go to" bar on the top of the page. To browse all organizations you can go to the Organizations tab of your workspace where you can Find organization.
Note that some organizations may be hidden from public view, in which case you will need to contact someone at the organization to get the direct link.
On the organization home page you can see all of the organization's datasets that you have at least Overview access to. If you're looking for a dataset you expect to be here and you don't see it, you might need to contact the organization administrators to get access.
You can see the datasets that this organization has featured on the left section of the page, type a term into the search bar, or click the Datasets tab to browse everything. By default these datasets are sorted by popularity but you can reorder them however you'd like to find what you want.
You can also click the Workflows tab to see workflows featured by this organization which showcase how their data can be used. More on workflows later!
Once on the dataset tab you can use multiple filters along with search terms to find what you'd like. All searches take a comprehensive look at information from the dataset documentation as well as variable metadata, including value labels.
You can also use the filters to find specific information you're looking for. If you know the name of the variable or year of the data you're interested in, narrow your search here.
Each dataset that appears here matches all search and filter configurations. You can click on the dataset tile to see more information, including a preview of all of the tables this dataset contains.
Click on any of these tables to dig into it further. As long as you have metadata access you can see the variable information and summary statistics.
Once you find a dataset you want to explore further, click on the title to go to the dataset page.
This dataset page has all of the information you'll need to familiarize yourself with the data and get started working with it further. Add this dataset to your dataset library to easily find it later on by clicking the button alongside the dataset title.
Overview
This tab has a short summary of the dataset along with different sections filled out by the administrators such as the Methodology, Documentation, and Provenance information. This Provenance section will also have citation information and a DOI if one has been created. You can also see your current access level on the right side of the page.
Tables
This tab contains all of the dataset's tabular data. You can see each table alongside information about its contents. Click on each one to dig deeper into the data.
Files
This tab contains all of the dataset's unstructured data. You can preview each file by clicking on it.
Usage
This tab has information about how other people have used this dataset on Redivis. You can see the most popular variables across all tables to get a sense for which ones might be a good starting place to understanding the structure of the data.
If you don't have data access to this dataset you'll need to apply for access before you can work with it further
Add this dataset to a workflow to transform and analyze it leveraging lightning fast tools from your browser.
make data and workflows available to researchers and administrators using their own systems.
Assigns a persistent identifier (PID), such as a DOI
Identifier points to a persistent landing page
Plan for long-term data management
Maintain integrity, authenticity, and availability of datasets
Stable technical infrastructure
Stable funding plans
Contingency plans to ensure data are available and maintained
Datasets accompanied by metadata
Aids in the easy discovery, reuse, and citation of datasets
Schema appropriate to relevant data communities
Provide or allow others to provide expert curation
Quality assurance for accuracy and integrity of datasets and metadata
Broad, equitable, and maximally open access to datasets and metadata
Access is free of charge in a timely manner consistent with privacy
Makes datasets and metadata available to reuse
Provides ability to measure attribution, citation, and reuse of data
Provides documentation for access and use
Prevents unauthorized access, modification, and release of data
Ensures administrative, technical, and physical safeguards
Continuous monitoring of requirements
Download, access, and export available in non-proprietary formats
Ability to record the origin, chain of custody, and modification of data or metadata
Provides policy for data retention
Utilizes consistent consent
Enforces data use restrictions
Implements measures to protect data from inappropriate access.
Has a response plan for detected data breaches.
Controls and audits access to and download of datasets.
Has procedures for addressing violations and data mismanagement.
Process for reviewing data access requests.
Metadata is available in various machine readable formats, such as schema.org and DataCite JSON, and can be viewed through the interface or downloaded via the .
Redivis is built on data principles to support data discovery and reusability at all levels. Data practices adhering to these principles emphasize data that is:
can be issued for datasets, facilitating authoritative citation and attribution.
All datasets have a section of documentation which auto-populates with administrator actions and allows for additional linking to other artifacts that were part of the dataset creation process.
Redivis has comprehensive search tools that index all aspects of a including the , , variable names and variable documentation.
Public data and analyses can be explored without an .
Researcher accounts can be linked to .
Robust support interoperability with other tools.
use common languages such as SQL, Python, R, Stata, and SAS.
Data and metadata are available for in multiple common formats.
Datasets are automatically .
containing analyses are self-documenting and capture the full analysis pipeline.
Anyone can a workflow they have access to in order to continue an analysis.
is a data platform for academic research. It provides organizations with a central hub where researchers can easily discover, access, and analyze their data.
Interested in bringing your organization on to Redivis?
Certain users may want to dive into the more technical .
If your organization is part of an institution (such as or ) you can also go to the institution page to search all datasets in your institution.
Learn more in the guide.
Learn more in the guide.
Learn more in the guide.
Contact an existing administrator to add you to their organization, or to set up a new organization. Learn more in the guide.
If your organization is part of an institution (such as or ) you can also go to the institution page to search all datasets in your institution.
Not finding what you need? You can also to Redivis through your workspace to augment a research project you're working on and to share with other researchers. Or if you represent a research group, center, or institution, about setting up your own Organization page.
Learn more in the guide.
Learn more in the guide.
Learn more in the guide.
Learn more in the guide.
If your organization is part of an institution (such as or ) you can also go to the institution page to browse all organizations in that institution.
Learn more in the guide.
Learn more in the guide.
We've started building a library of common workflow types and analytic tasks that might be useful when you're getting started working with data.
Please let us know if there are other actions or concepts you would like us to provide examples for by reaching out or emailing contact@redivis.com
Many datasets on Redivis are public, while others have requirements for certain levels of access enforced by the data owner. If the dataset is public, you can skip over this section, otherwise you'll need to gain access before you can fully utilize the data.
On the right side of all restricted dataset pages is a box which displays your current access level. If this doesn't say Data access, then your next step will be to apply for access.
Click the Apply for access button on the top of the page to open the access modal. This modal allows you to manage your access to this particular dataset throughout the platform.
Dataset access has five levels:
Overview: the ability to see a dataset and its documentation.
Metadata: the ability to view variable names and summary statistics.
Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.
Data: the ability to view and query a dataset's tables, and work with them in workflows.
Edit: the ability to edit the dataset and release new versions.
Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.
Usage rules are restrictions placed on the way you can use this data once you've gained access. For example, data exports may be restricted to certain export environments, and/or require administrator approval.
While no action is required when you first apply for access a dataset, it's helpful to understand these restrictions to better understand what limitations, if any, they might pose to your research.
The access modal for the dataset will include all steps designated by this dataset's administrators to gain access to work with data.
To apply for access to datasets hosted by a Redivis organization, you will first need to Join the organization.
To join an organization, you will need to provide information about your identity via an email authentication, usually the email associated with your Redivis account.
After submitting your membership application, it will either be automatically approved or will be marked as pending while an administrator reviews your submission. You'll receive a notification if your submission is approved or rejected.
Some datasets have requirements as part of their access applications. Requirements are global to an organization and usually contain some type of form. Once all requirements for a particular level of access are approved, you will gain access to the relevant dataset(s).
In the access modal, click the Apply button on any requirement to fill out and submit the form.
When you submit a requirement, an administrator of the organization will be alerted and can review your submission for approval (in some cases, the requirement may be configured for auto-approval upon submission). You will receive a notification if your submission is rejected or approved, or in the future when it's about the expire. You can also leave a comment for an organization's administrators here, and they can reply to you.
There are two types of requirements - those you fill out on behalf of yourself (member requirements) and those you fill out on behalf of a study (study requirements). If the dataset you are interested in has a study requirement, you'll need to create a study and select it in the access modal in order to submit the requirement. Additionally, any workflows that utilize this dataset will need to be a part of the approved study.
Sometimes you don't need to fill out any requirements and need to simply Request access at a specific level. You can do so in this modal, and will receive a notification when you are granted access by the dataset owner.
For some membership and requirement submissions (and all direct access requests), you will need to wait for the data owner to respond to your submission. If the dataset is owned by a user, they will need to approve your request; if it is owned by an organization, any administrator at that organization can approve it.
The data owner will receive a notification as soon as the requirement is submitted, and you will receive a notification when it is approved, rejected, or for any requests for additional information. By default, you will receive an email as well as a notification within Redivis. You can customize this notification behavior on your workspace settings.
If you are accessing restricted data hosted by an organization, you may have to re-authenticate after a period of inactivity anytime you perform an action governed by the "Data" access level.
After a period of inactivity (set between 15 minutes and 3 days by the organization administrator), a popup will ask you to re-authenticate anytime you view data, query tables in a transform, or export data from Redivis.
Add this dataset to a workflow to transform and analyze it leveraging lightning fast tools from your browser.
Learn more in the Work with data in a workflow guide.
It's important to make sure that you fully cite any datasets used in your work. Each dataset page includes citation information in the provenance section in multiple format options.
For example the Global Historical Climatology Network dataset hosted by the Stanford Center for Population Health Sciences would be cited in the APA format as:
Stanford Center for Population Health Sciences. (2022). GHCN Daily Weather Data (Version 1.2) [Data set]. Redivis. https://doi.org/10.57761/h9ff-vy04
Workflows across Redivis also include citation information in the provenance section in multiple format options.
Datasets and workflows on Redivis can be supplemented with related identifiers that situate a resource in its broader context. You can view the bibliography for a dataset or workflow from the corresponding provenance section.
To cite Redivis in publications use:
Redivis (RRID:SCR_023111). https://redivis.com
If referencing Redivis inline, the abbreviated citation is also acceptable:
Redivis (RRID:SCR_023111)
To cite Redivis, you can use the following BibTeX:
You can also refer to the Redivis Re3data identifier
Studies are a way for one or more collaborators to work with data on the same conceptual topic. You can add multiple collaborators, datasets, and workflows to a study to organize your workflow. In addition, some restricted data requires you to submit access applications as a study via study requirements.
On the studies tab of your workspace you can create a new study and see all studies you are a part of. This study might represent a topic or group you are working with, maybe to investigate a similar cluster of research questions. You can give the study a name and description to reflect that purpose.
If you are working with anyone else on this study you can add them as a collaborator. Anyone who is a collaborator on this study will be able to view and edit the study same as you can. Once added, this study will appear in their workspace.
One person on the study can be designated as the PI. This designation does not have meaning within Redivis interfaces and is here to inform others in the group or administrators you are applying for access from of the group structure.
You can add any dataset to this study that you plan on working with, and remove it later if your topic shifts. This space is intended for you to gather resources that you are using in one place to make it easier to create workflows or manage data access. Datasets can be added to any number of studies.
You can create a new workflow in this study or move an existing one in. All workflows can be in at most one study, and you can see what study a workflow is in on the workflow's overview page.
One of the options for sharing this workflow is to grant view or edit access to all members of the study it's in. This can be an easy way to give everyone in a group access to a workflow you are working on.
As with all workflows, even though someone might have access to the workflow they will need to have independent access to all the datasets it contains in order to view or query them.
If you're working with restricted data you might come across a dataset that has a study requirement as part of its access requirements. You can identify this by the study icon next to the requirement name.
Study requirements are filled out and submitted on behalf of an entire study, rather than one person. An approved study requirement will be valid for all collaborators on the study, unlike member requirements where each individual needs to complete the requirement on their own account.
To submit a study requirement you can navigate to a restricted dataset and open the access modal. You will need to select your study from the dropdown menu, and a submit button will appear. One member of your study will fill this out and submit it on behalf of your group. Once approved, all study collaborators will see an approved requirement in their access modal.
When restricted data has a study requirement, the data administrator can approve your application submission for use of the data within that study. In order to query data in a workflow, that workflow will need to be in the approved study. If it is not, you will see a badge noting Limited access and you will not be allowed to run transforms or query this data in a notebook.
Once you've gained access and set up a study it's time to add your datasets to a workflow to transform and analyze them leveraging lightning fast tools from your browser.
Workflows are where you work with data on Redivis. In a workflow you can query, merge, reshape, and analyze any data that you have access to, all from within your web browser.
In a workflow, you can construct reproducible data transformations and analyses, and share and collaborate with your peers in real time.
Add a dataset to a new or existing workflow from any Dataset page where you have "Data access" by clicking the Analyze in workflow button.
Within workflows you can navigate between entities on the left side of the screen, and inspect them further on the right panel. You can inspect your dataset further by clicking on any table to see its cells and summary statistics.
To add more data to this workflow you can click the Add data button in the top left of the workflow toolbar. It is also possible to add other linked workflows to this workflow. This is useful as you develop more complex analyses that you want to segment into discrete pieces of work that you can link together.
You can find this workflow later by going back to your workspace.
Transforming tables is a crucial step in working with data on Redivis. Conceptually, transforms execute a query on source table(s), whose results are materialized in a new output table. In most cases you'll want to use transforms to reshape your data to contain the information you're interested in, before analyzing that table in a notebook or exporting it for further use.
To create a Transform, select a table in this dataset and click the +Transform button. You can get started here building a query through the point and click interface or writing SQL code by adding a SQL step.
For all transforms you will need to select which variables you want to keep in your output table. The rest of the steps are up to you. Some common operations you can get started with include:
Once you've built your query, execute it by clicking the Run button in the top right of the transform. This will create a new output table where you can inspect the output of your query by clicking on the table beneath the transform in the map and making sure it contains the data we would expect.
From here you can create a new transform from this output table to continue reshaping your data, or go back to your original transform to make changes and rerun it.
As you become more familiar with transforms, you can start doing more advanced work such as geospatial joins, complex aggregations, and statistical analyses.
Once you have a table you're ready to analyze, you can select any table and click the + Notebook button to create a notebook that references this table.
The default notebook configuration is free, and provides access to 2 CPUs and 32GB working memory, alongside a 60GB (SSD) disk and gigabit network. The computational powerful of these default notebooks are comparable to most personal computers, and will be more than enough for many analyses.
From here it’s all up to you in how you want to analyze and visualize your data. Once you’ve finalized your notebook, you can easily export it in different formats to share your findings!
You can share your in-progress work or finished results with collaborators by sharing this workflow.
Researchers can work side by side in this workflow in real-time. Leave comments to communicate, and see a visual cue for what each person is working on. You can even collaborate within a running notebook at the same time.
If any of the data in your workflow is restricted, your collaborator must also have access to those datasets in order to view their derivatives within your workflow.
Redivis workflows are built for collaboration and include real-time visuals to see where collaborators with edit access are working in the workflow, and a comments interface to discuss changes asynchronously.
Redivis workflows excel at working with large tables, whether it's filtering and joining, complex aggregation and date manipulation, or visualization and analysis.
Learn more in the guide.
in any other dataset table or output table in this workflow
records to match defined parameters
variables or
data
Learn more in the guide.
Notebooks are available in Python and R, as well as Stata or SAS (with a corresponding license). Notebooks come pre-installed with common libraries in the data science toolkit, but you can also customize the notebook’s and startup script to create a custom, reproducible analysis environment that meets your needs.
If you're working with larger tables, creating an ML model, or performing other particularly intensive tasks, you may choose to configure additional for the notebook. This will cost an hourly rate to run based on your chosen environment, and require you to purchase on your account.
Notebooks come pre-populated with some starter code you can use to import data, and the contain comprehensive documentation and further examples.
To learn more about analyzing data, see our guide.
to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.
If you'd like to export data to a different system, you can download it in , reference in , or visualize in tools such as .
Learn more in the guide.
Learn more in the guide.
Transforming tables in a workflow is a crucial step when working with data on Redivis. Conceptually, transforms execute a query on a source table and results are materialized in a new output table. They are optimized to run on billions of records in seconds, and create a transparent, reproducible record of data transformation steps you've taken in your workflow.
In most cases you'll want to use transforms to create an output table containing all the information you're interested in before analyzing that table in a notebook or exporting it for further use.
Once you've created a workflow and added a dataset to it that you'd like to work with, get started by creating a Transform. You can do this by clicking on any dataset or table.
To build this transform, you will add steps which each take an action to shape the data in the output table. You can choose from many named steps which include a point and click interface that compile to SQL code, or you can add a SQL query step to write code directly. You can always view the code that your step is generating and switch to code if you'd like to edit it.
While SQL might not be a familiar language, it is optimized for data cleaning procedures and allows transforms to execute extremely quickly. It also allows you to write your data transformations in a declarative, reproducible manner.
The first thing you'll want to do is decide if you have all the information you need in this table or if you'd like to join in an additional table or tables. You can reference any other table from either 1) this dataset, 2) another dataset, or 3) an output table from a different transform in this workflow. Any table you want to reference needs to be in this workflow, so you can click the Add dataset button to add a new dataset to the workflow if it's not here already.
To join a table, add a Join step and select the table you'd like to join. You'll then need to select what type of join it will be, and build your join condition. In most cases your join condition will be linking two variables of the same name and type together (e.g. join all records where id
= id
).
Learn more in the Joins reference section.
You might want to generate new columns in this table by creating a new variable. You can do so by adding a Create variables step. Start by giving this variable a name and then choosing what method you want to use to create it.
Some methods are restricted to the type of variable you are working with. For example there are options to add or subtract years from a date variable, or concatenate string variables.
One of the most common new variable methods is Case (if/else). This allows you to set up a statement that looks at the content of each record and evaluates it based on conditions you've set up to generate the value. For example you can say that in the new variable you're creating, if the amount in column A is greater than 1000, the value of this new variable will be set to "high" and if not, then it will be set to "low."
You can create any number of new variables, and they will execute sequentially, allowing you to reference variable's you've created in other variables and subsequent sections.
Learn more in the Create variables reference section.
You'll probably want to reduce the number of records in this table to exclude any that aren't relevant to the final output table you're creating. This will allow you to execute transforms quickly and get a better understanding of the number of relevant records you have.
To filter records, add a Filter step to your transform and start building the conditions that records will need to meet in order to stay in your output table. These statements can be nested and allow you to reference any variables or record values.
If you find yourself working with a standard list of values you're using in multiple places, this might also be a place to save time and enhance reproducibility by creating and referencing a value list.
Learn more in the Filters reference section.
Depending on the structure of your data you might want to aggregate the data to collapse multiple records down to one, while preserving some information about what was dropped. Conceptually this might look like aggregating a table of charges from one record per charge into one record per person including a variable for the total charge per person.
To get started, add an Aggregate step and select the variables that you will to aggregate on. These will be the variables that exist in your output data after the aggregation is finished. All records that are an exact match in these selected variables will be dropped.
You can also capture information about records being dropped by creating a new aggregate variable. For example, maybe you are aggregating a table with multiple test scores per person down to a table with just one record per person. You can create an aggregate variable with the average test score, or the count of the number of tests each person took.
Learn more in the Aggegate reference section.
Finally, before running your transform you'll always need to select which variables you want to keep in your output table. Perhaps you referenced a variable in this transform to create a new one, and now you don't need it anymore. Cutting variables means faster execution, and it is easy to add any variables back later and re-run the transform if you realize you need it, so in general try to keep this list as short as possible.
To propagate variables forward into your output table, they need to be in the right-hand box at the top of the transform labeled Keep. You can select any variable, or set of variables and click the > arrow button to move them over.
Learn more in the Variable selection reference section.
When you're ready to execute the transform click the Run button in the top right of the toolbar.
If this button is disabled, it might be because the transform is invalid for some reason. Hovering on this button will give you more information, or you can look for the alert symbol (!) to see where you'll need to fix the transform in order to make it valid to run.
Successfully running a transform will create a new output table. Click on this table to see your output and verify that the transform executed as you would expect. Does it have the approximate number of records you expect to see? If you click on a variable you can see summary statistics which can help you verify key variable concepts, such as the distinct number of values.
Every time you run a transform you will want to sanity check your output.
If there are any issues with your output table, or if you decide to go in a different direction, or add another step, it is easy to go back to your transform and start making changes.
You'll notice that this transform is now yellow and so is its output table. These edited and stale state indications help you keep track of work you're doing in your workflow, and you can easily revert to the previous successfully executed transform state at any point.
Run this transform again to see changes in the output table. This guide describes all the steps you can take in a single transform but perhaps you want to do one step at a time and run it in between each step to sanity check your output. This system is designed for iteration so make as many changes as you want to experiment and build the output you want.
From here you can continue to create transforms on your output table, tables from the dataset you added, or add any other dataset on Redivis. You might want to fit as many steps as you can into one transform, or make a long chain to more easily track your work and communicate it to others.
As you build more transforms you'll see that sometimes actions you take create stale chains of transforms. You can easily make upstream changes in your workflow (such as upgrading a dataset from the sample to the full dataset, or updating to a new version) and then run all transforms in your workflow with one click (a Run all option is available in the Map button menu).
You can use notebooks in a workflow to analyze data using Python, R, Stata, or SAS. These notebooks run in the browser with no additional configuration and seamless sharing with collaborators.
Learn more in the Work with data in notebooks guide.
If you'd like to export data to a different system, you can download it in various file formats, reference it programmatically in Python / R, or visualize in tools such as Google Data Studio.
Learn more in the Export to other environments guide.
Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).
Learn more in the Create and populate a dataset guide.
Redivis workflows are built for collaboration and include real-time visuals to see where collaborators with edit access are in the workflow, and a comments interface to discuss changes asynchronously.
Share your workflow to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.
This guide demonstrates using a Redivis workflow to gather key variables from different tables, clean them, and and consolidate them into a single table for analysis in Google Looker, all using point and click interfaces.
We'll use Gini coefficient data (a common measure of income distribution) from the American Community Survey (ACS) to build a map that shows the range of income inequality by US county in 2018.
This workflow is on Redivis! We also suggest you recreate this workflow as we go to best learn the process.
We have already uploaded all relevant data for this workflow into a single dataset in the Redivis Demo organization called Example Workflow Dataset which contains contains three tables:
ACS Gini coefficient table
ACS table that measures population by county
National Weather Service dataset that maps a US county to lat/long coordinates.
Add this dataset to your workflow and click on the dataset node to see all of the available tables. Select Gini Index by County and can look at the data format by clicking Cells.
We want to map the Gini estimate and population by US county in our final data visualization. To map data points by county, we will need a variable that represents the latitude, longitude coordinates of that county. We also want the state, county name, and five-digit county code.
Therefore, we want our final dataset to include six distinct variables:
Gini Estimate
State
County Name
County Code
Latitude, Longitude
Population
We will get Lat/Long and Population by joining our table with other tables later on in the workflow. For now, we can create a transform to start reshaping our data.
We already have this variable in our dataset: B19083_001E. We will create a Rename step to rename this to gini_estimate so we can keep better track of it.
We want to break up our NAME variable into two separate columns, county and state. So we will add the Create variables step and use the Regexp extract method.
This method allows users to create a new variable with characters that match a specific pattern of the old variable. For example, creating a new variable for county, we can select all characters in NAME preceding the comma. We can add a new block and do a similar process on the NAME variable to get the State.
The last five digits in the GEO_ID variable represent the county code. We create another new variable block and name our new variable County_code. Select the Substring method and select all values starting at index 10 up to a max length of five.
Finally, we choose which variables to keep or discard in our transform. We’ll keep all variables except for GEO_ID, NAME (county and state combined), and B19083_001M (the margin of error for the Gini estimate).
Click the run button in the upper right-hand corner to run this transform.
Click on the newly generated output table to investigate the output of the transform. We can see that this table has the same number of records as our initial table and has the four variables we selected. If we click on the County and State variables we can see in the frequency tables that they look like we wanted them to.
Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.
We now have a table with the Gini estimate, state, county name, and county code, but we need the latitude and longitude information in order to map each county and we also want to account for the size of each county with a population variable. To do this, we'll perform two joins.
We could continue to work in our initial transform but we are choosing to create new transforms for each join to keep steps separated conceptually and provide output tables to reference along the way.
First, we will match each county code to its lat/long coordinates by joining with the "County to Lat/Long Coordinates" table.
Create a new transform and add a Join step.
A Left Join returns the complete source table (Gini Index by County) and all matching records from the right table (County to Lat/Long). We set the County_code variable in the source table equal to the County_code variable in the right table, which matches a set of lat/long coordinates to each county code (if there is no coordinate, it will return Null).
When you set two variables equal to each other in a Join, they must be of the same type. If, for example, you set a string type variable equal to an integer type variable, you will have to retype the string as an int (or vice versa).
Some county codes, however, map to more than one set of latitude, longitude coordinates so we must create two new variables in this transform: unique_lat
and unique_long
.
For unique_lat
, we will partition on all kept variables except LAT and take the average of the LAT variable. This says that for that each county, compute the average of all possible latitudes and store that average in the variable unique_lat
.
We will select all variables from the source table as well as unique_lat
and unique_long
in our output table. Finally, we select "distinct" in our variable selector so that we will drop all records that are unique.
Running this transform outputs the following table, where each county now corresponds to a unique set of latitude and longitude variables.
We now have all the desired variables in our dataset except for population per county, which we will need when we create the final visualization.
Create a new transform and add a Join step.
Using the same steps from the previous join, we will perform a second join to match each county to its total population by incorporating the Population by County table (also in the original dataset). We select a Left Join and set the County_code variable in the source table (Gini Index by County) equal to the County_code variable in the right table (Population by County).
Selecting all variables to remain in our output table and running this transform outputs the following table.
We will be using Google Looker Studio to visualize this data so we will need to update our data to make it work smoothly with their specified format for geographic data. Using their reference sheet, we can see that we will need our latitude and longitude information in the format:
We will need to combine our two separate values and add a comma between them.
We can do that using the Concat method for creating a new variable, but that method only accepts string inputs so we'll first need to retyping the unique_lat
and unique_long
variables from float to string using the Retype step.
Then we use the Create new variables step to make a new variable named lattitude_longitude
and use the Concat method to concatenate the unique_lat
string, a string with a comma, and the unique_long
string.
We can now discard unique_lat
and unique_long
and keep the combined variable, latitude_longitude
.
Running this transform yields the final table. From here, we can edit variable metadata and/or download our new dataset in the format that we'd like.
The next step would normally be to create a notebook in this workflow and use Python, R, or Stata to analyze this table. However if we want to create a quick visualization and aren't familiar with any of those coding languages we have easy options to export data to other systems such as Google Looker Studio.
For this example we will link this table directly to Google Looker Studio by clicking Export table on this table.
We could download our final table and the re-upload it into Looker Studio, but we choose to link the table through the Redivis connector so that if we come back to this original workflow and make changes they will be reflected in the visual that we're about to make there.
Follow the prompts of the connector to log in and authorize access, then we'll need to indicate the table we'd like to use. For that we'll need:
The owner of the workflow's ID (shown in the top menu bar of the workflow)
The name of the workflow (shown in the middle of the black workflow toolbar)
The name of the table (shown on the title of the table node)
When the table is being imported we will have the option to change the type for variables. We'll need to change the type of our lattitude_longitude
variable from string to the Lattitude, Longitude geography type.
Then click Create report to get started!
For this example we are going to build a bubble map that shows the size of population against the intensity of gini disparity in certain regions.
Get started by adding a chart and selecting the bubble map type. We will need to select Redivis as the data source for this map.
Then we will need to define which of our variables map to which parts of the visual. We will also want to change how they are aggregated, from SUM to AVG.
Location: lattitude_longitude
Tooltip: County
Size: (AVG) population
Color metric: (SUM) gini_estimate
The resulting map is automatically interactive. Users can hover over each bubble and view the county name, population, and Gini coefficient. The bubble size is determined by the county population and the bubble color is determined by the Gini value. View the interactive report:
You can continue to create a variety of visualizations in this report, including scatter plots, Google maps, and stacked bar charts.
Refer to the Looker Studio Help pages for additional specific guidance!
Perhaps we see something in this workflow we want to tweak, or we want to go back and change some of our data decisions. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them.
Notebooks can also create output tables which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also fork this workflow to work on a similar analysis, or export any table in this workflow for work elsewhere.
This guide demonstrates using a Redivis workflow to gather key variables from different tables, clean them, and and consolidate them into a single table for analysis in a notebook.
We want to take weather data collected around the world and use it to understand how precipitation trends in the US have changed over time.
This workflow is on Redivis! We also suggest you recreate this workflow as we go to best learn the process.
All the weather data we need is contained in the Demo organization dataset GHCN Daily Weather Data.
To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need. Some tables jump out as ones we will want to work with:
This table has nearly 3 billion records and seems to be the table in the dataset with the most information so we will start here. There is a variable that looks like it will be very helpful for our goal: element
If we click on the variable name, we can see a searchable frequency table with all this variable's values. One value that jumps out is PRCP
which we can see on the value label represents the precipitation. Paired with this is the variable value
which contains a numeric value recording the magnitude of the element. We can learn even more about this by clicking on the data dictionary in the dataset's overview page to see that this value is measured in either mm or inches.
We can see that this table doesn't contain any information about location though, so we'll need to find that somewhere else.
This table contains latitude
and longitude
variables as well as a linking variable id
which will allow us to link in precipitation information from the Daily observations table, since it has the same id
variable.
Since there are two tables we want information from we know that we will need to do a join as part of our data reshaping.
At the top of this dataset page we can click the Analyze in workflow button to get started working with this data.
You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.
We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with. Even though we might be more comfortable with Python or R, the table we want to work with is 132GB and we would not be able to work with it in a notebook without specialized equipment.
Filter for precipitation
If we click on the element
variable we can see that there are many types of elements recorded in this table. Since we only want information about precipitation (PRCP
), we will start by filtering out any records with a different value in the element
variable.
Start by creating a new transform on the Daily observations table, and add a Filter step. Configure the filter to select for records where element
=
PRCP
.
Looking at the data, it's clear there are many observations of precipitation at each location over time. We know we want to look at annual trends so we can start by aggregating this data to only have one value per year per location.
Conceptually, we will need to aggregate on the year of each record, but if we look at the date
variable in this table it also contains the month and day. For aggregation we will need a field which contains only the year which we can collapse on.
In this same transform we will add a Create variables step and make a new variable named year
using the date extract method, which will pull the year out of the date value.
Then we will add an Aggregate step to aggregate on the values based on our new year
variable. The first thing we will do is select the variables to aggregate on, or select which variables that we want all records with duplicate values to be grouped together. Since we want information on precipitation per station per year we should choose to aggregate on the id
variable (station) and year
variable. When executed this will group all records with the same combination of values in id
and year
, drop all other variables from the table, and then drop duplicate records down to just one record per unique combination in those two variables.
But the most important step here is that we want to gather information about the duplicate records that were dropped! If there were 400 records of precipitation in a year for a particular station, we want to know what those records all add up to. To do this we will Create a new aggregate variable within this aggregation step named annual_precip
. We want to aggregate the information in the value
column, since that contains the numeric amount of the corresponding element
variable. Since we want the total amount across all dropped records we will use the Sum
method.
The final step in a transform is selecting which variables we would like to populate the resulting output table. Since we did an aggregation step, this list becomes limited to only the variables that we aggregated on. Since we want both of these variables in our output we make sure both are selected and visible in the right side list of the footer.
With everything in place we will run this transform to create a new table, by pressing the Run button in the top right corner!
Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.
Click on the output table below the transform to view it. We can see that it contains the two variables we expected it to, based on our variable selection. The table has almost 3 million records, down from the almost 3 billion in our original table, which makes sense given our aggregation step.
We can also inspect each variable further. If we click on the year
variable we can see that all values are a four digit year value and we can see from the min and max values in the summary statistics that the years range from 1781 - 2021 which make sense. Looking at the frequency table we can see that we have a high frequency of each year represented in the records which is what we expect since there are multiple id
s or locations per year.
If we click on the annual precipitation variable we can see a max value of 792,402 (mm or in) which seems very high but possible for a given year. The minimum value is -148,917 which doesn't seem right. When we look at the data dictionary it doesn't indicate that there should be negative values.
We can investigate our data further by clicking on the Query tab of the table and choosing "Sort on variable" from the templates menu, and sort on our annual_precip
variable by ascending (ASC). It looks like there are a relatively few number of stations with negative values.
If we were doing a rigorous analysis of this data we might dig deeper into why these value exist in this dataset, or decide to exclude these outliers. However since this is an example we can decide to leave them in and see how it affects our outputs. We can easily go back and exclude them from the workflow at a later point.
Now that we have a table we are satisfied with that contains annual precipitation information, we want to join in the latitude and longitude information about each station so we can make a geographical visual as part of our analysis.
This geographic information is included in the Stations table, so we will need to do a join which connects records from two tables.
While we could do this transformation step in our previous transform and rerun it, let's create a new transform to make this join so that we will have our previous output table saved to compare the output to.
In the new transform add a Join step. Set the joined table to the Stations table from our original dataset. The join type will depend on what variables we want to keep based on how the matching process goes. Since we only want to keep records that have both information from our source table (annual precipitation) AND information from our joined table (station lattitude and longitude) we set the join type to Inner join.
Since our identifier in both table is the variable id
, set id
from the source table (t0) as the left side of the join condition and set id
from the joined table (t1) as the right side of the join condition. This means that for every record in our source table, look for a corresponding value in the Stations table and join the values from additional columns in that table.
There are a number of additional variables in the stations table but we are only interested in the latitude
and longitude
variables. In our variable selector we move all variables from our source table (year
, id
, annual_precip
) as well as latitude
and longitude
.
With everything set up, run this transform and sanity check the output table! We started with a 132GB table and have narrowed it down to 134MB containing only the data we want in the format we will need.
Next we want to generate an interactive visual to understand annual precipitation events in different locations over time. To do this we will create a Notebook node on our output table.
For this workflow we will use Python but you can also use R, Stata, or SAS if you'd like. When you create the notebook for the first time it will start up. Notebooks must be running to execute code.
Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook.
You can use this pandas code or replace it to use the dataframe of your choice. To use the standard code, click inside this cell and press the run button, or the Shift
+ Enter
keys.
Now we will create a new cell to organize our work by pressing the +
button in the top bar.
In this cell we want to limit our records to only ones whose latitude and longitude values fall in a specific range of the continental United States.
We want to see how our annual precipitation variable has changed at stations in the United states so for each station we will group by the id
variable and compute an average for annual_precip
.
Now that we have the information calculated, we can plot it! We'll use Plotly to do the heavy lifting, and point it at our relevant variables.
Now we've created an interactive figure showing change in precipitation by location in the United states.
Perhaps we see something in this plot we want to investigate further, or we want to go back and change some of our assumptions. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them.
Notebooks can also create output tables which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also fork this workflow to work on a similar analysis, or export any table in this workflow for work elsewhere.
This guide demonstrates using a Redivis workflow to train an ML model on a set of images stored in a Redivis dataset.
This is an example workflow demonstrating image classification via Convolutional Neural Networks. It imports an example dataset containing several thousand test and training images of cats and dogs, with which we can train and evaluate our model.
This workflow is heavily adapted from its initial publication at: https://gsurma.medium.com/image-classifier-cats-vs-dogs-with-convolutional-neural-networks-cnns-and-google-colabs-4e9af21ae7a8
We can go to this dataset to browse it's tables to understand the structure of the data it contains.
We see three tables here, and all of them are file index tables. That means that each table contains an index of the files (unstructured data) this dataset contains, sorted by the folder the administrator uploaded them into. We can click on the Files tab of the dataset to see each file individually, and click on it to see a preview.
This dataset has three groupings of files:
Training images (we will use these to build the model)
Test images (images not included in the training set that we can verify the model with)
Example file types (unrelated to this workflow)
If we click on the Tables tab, and click on the training images table, we can see high level information about this set of files. We can see that there are 25,000 files, and when we click the Cells tab, all of the file names we can see end in .jpg. We can hover on these to see a preview of the image, and we can click on the file_id
variable to see a preview of the image with more information.
At the top of this dataset page we can click the Analyze in workflow button to get started working with this data.
You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.
We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with.
We need to start by defining the training set, which conceptually means the set of images we know are cats and know are dogs to train the model on. Information about whether an image is a cat or dog is in the file name, so we need to pull it out into a new variable we can more easily sort on.
We want to include only some of our training set images into the training set we use to train the model, since we want to leave some aside to validate the model. So here we want to include exactly 5000 cat images and 5000 dog images. So we we will create a new variable rank
and filter on it so that we only keep the first 5000 images of each type.
The final step in the transform is deciding which variables we want in our output table. We will keep our new boolean variable is_cat
to use later, along with the file_id
and file_name
With everything in place we can run this transform by clicking the Run button in the top right corner.
Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.
Click on the output table below the transform to view it. We can see that it contains 10,000 records which is exactly what we expected
We can also inspect each variable further. If we click on the is_cat
variable we can see that there are 5000 true values and 5000 false values, which shows that our filtering was successful. We can also validate that the method we used to determine if an image is a cat or a dog worked by clicking on the Cells tab. Here we can see that records marked True have "cat" in their file name, and when we hover on the file_ID
value to see a preview, the image clearly contains a cat.
Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.
We need to create a set of image files separate from our training set where we know if the image contains a cat or dog. This will be used to validate the model training.
Create a new transform and take all the same steps as we did in the previous transform, but we will change the filter to keep images ranked 5001-7500, rather than 1-5000.
We will keep the same variables as we did in our training model, and then run this transform.
When we run this transform and inspect the output table we see what we expect here as well. There are 5000 total files and we can validate a few of them visually on the Cells tab.
When you create the notebook for the first time it will start up. Notebooks must be running to execute code.
Redivis notebooks come with many common packages preinstalled, and you can install additional packages by clicking the Dependencies button and importing libraries in the code.
Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.
Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook. For this example we will remove this sample code to import data according to our library's parameters.
This is where we will heavily rely on our selected libraries to build the model.
This is where we will train the model we just built using the image data we cleaned.
Now we will use the validation set to see how well our model works
! We also suggest you recreate this workflow as we go to best learn the process.
All the image data we need is contained in the Demo organization dataset .
Click on the table Training images and create a . This interface is where we will define a query which will run against our source table and create an output table. You can choose to write the query in SQL but we will use the interface for this example since it is faster and easier to use.
Add a step and name the new variable is_cat
. The method will be Regexp contains which allows us to easily identify presence of the string cat
from the file_name
variable. This new variable will be a boolean variable where true
means the image contains a cat and false
means it does not.
To do this, + Add block in the Create variables step and use the Rank method. This is an which means you will use the partition ability to partition on true and false values. For each partitioned value (true
and false
) a rank will be assigned.
Create a new step. Conceptually we will keep records up to 5000 in the rank
variable, which means it will include 5000 true values and 5000 false values.
Next we want to train and test a model using Python code the help of various Python libraries. Transforms are more powerful than notebooks but are based on SQL and operate linearly with only a single output table allowed. In order to work in Python, R, Stata, or SAS to generate visuals and other outputs we will create a node on our Training data output table.
The main libraries used to create this model are and . You can view their documentation for further details.
Perhaps we see something in this model we want to tweak, or we want to go back and change some of our underlying data. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and them.
Notebooks can also which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also this workflow to work on a similar analysis, or any table in this workflow for work elsewhere.
This guide demonstrates using a Redivis workflow to import an existing LLM and then use relevant data to fine tune it and run it on another similar set of data we are interested in.
Here, we want to fine-tune a pre-trained "foundational" LLM so that it can be used to score reviews. We will leverage an existing dataset that contains a collection of Yelp reviews and their scores to perform the fine-tuning, and then apply this classification model to other reviews (from Reddit) that do not contain an accompanying score. The goal here is to demonstrate how Redivis can be used to leverage, modify, and ultimately apply state-of-the-art LLMs to novel data.
This workflow is on Redivis! We also suggest you recreate this workflow as we go to best learn the process.
For this workflow we'll need our initial data to train the model on (in this case Yelp reviews) and the data we want to apply the model to (Reddit posts). These data are already on Redivis, split across two datasets uploaded to the Redivis Demo organization: Yelp Reviews (Hugging Face) and Reddit.
To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need.
It looks like there are two tables, one with reviews for testing a model and another with reviews for training a model. Clicking on each table in this interface shows that they both have two variables (label
and text
) and that the Train table has 650,000 records while the Test table has 50,000 records.
This data seems to be formatted exactly how we'll want to use it so we don't need to do additional cleaning .
This dataset has two tables with over 150 million Reddit posts and subreddit information, split across two tables. We can look more closely at the 33 variables in the Reddit posts, including univariate statistics.
For this workflow, we just want to look at reviews from one specific subreddit which reviews mouse traps: MouseReview
. If we click on the Subreddit
variable name, we can see a searchable frequency table with all this variable's values. If we search MouseReview we can see that this dataset contains 26,801 posts.
To move forward with this workflow we'll want to train a model on the Yelp dataset, and filter and clean the Reddit table make it more usable with our model. In order to clean or transform data and do our analysis we'll need to create a workflow.
We want to leverage an existing model that understands language and can generally be used for language classification. There are many open-source models that might meet our needs here; in this example, we'll use Google's BERT-base-cased model.
This model is hosted on Hugging Face, so we could load it directly into our notebook at runtime. However, if our notebook uses restricted data, it might not have access to the external internet, in which case we'll need to load the model into a dataset on Redivis.
The Redivis dataset for this model can be found here. You can also learn more about loading ML models into Redivis datasets in our accompanying guide.
At the top of any dataset page, we can click the Analyze in workflow button to get started working with this data.
You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.
Add the additional datasets by clicking the + Add data button in the top left corner of the workflow and searching for the dataset by name.
Once we've added all our datasets to the workflow, we can get started. To begin, we'll create a python notebook based on the Yelp reviews training data, by selecting that table and clicking the + Notebook button.
To enable GPU acceleration, before starting the notebook, we'll choose a custom compute configuration with an NVIDIA-L4 GPU, which costs about $0.75 per hour to run (we could use the default, free notebook for this analysis, but it would take substantially longer to execute).
We'll also need to install a few additional dependencies to perform training and inference via Hugging Face python packages:
With that, we can start our notebook! The full annotated notebook is embedded below, and also viewable on Redivis.
The general steps here are as follows:
Load the training and test data from the Yelp reviews dataset
Load the pretrained BERT model
Train this base model on the Yelp data to create a fine-tuned model that can classify text reviews from a score of 0-4.
The Yelp data was ready to go as-is, with a simple text field for the review and integer value for the score. For the Reddit data, we just need to run a quick filter to choose posts from the appropriate sub-reddit.
We will use a transform to clean the data, as transforms are best suited for reshaping data at scale. Even though we might be more comfortable with Python or R, this dataset table is 83GB and it will be much easier and faster to filter it in a transform rather than a notebook.
Create a transform
Click on the Posts table in the Reddit dataset and press the + Transform button. This is the interface we will use to build our query.
Add a Filter step. Conceptually we want to keep records that are part of the subreddit we are interested in, and are not empty, deleted, or removed.
The final step in a transform is selecting which variables we would like to populate the resulting output table. In this case we just need the variables title
and selftext
.
With everything in place we will run this transform to create a new table, by pressing the Run button in the top right corner.
Finally, we can apply our fine-tuned model to the subset of Reddit comments that we want to analyze. Ultimately, we produce a single output table from the notebook, containing the reddit post and associate score generated by our model.
Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them. Perhaps we want to look at other subreddits, or run the model on a larger sample of the Yelp data.
You can also fork this workflow to work on a similar analysis, or export any table in this workflow to analyze elsewhere.
We do recommend further familiarizing yourself with the examples and detailed documentation to take full advantage of the capabilities of Redivis notebooks:
Redivis workflows contain powerful tools to reshape and analyze data but if you prefer to export data into a different workflow you can easily do so. Redivis systems use open source tools and common formats to make the transition as easy as possible.
Some datasets on Redivis have export restrictions which prohibit or limit removing data from the system.
You can check this by going to the dataset page and clicking the top right button Manage access, or by right-clicking on a dataset in a workflow and selecting the View access option.
The bottom section of this access modal defines any export restrictions in place. It might not have any restrictions, might be completely restricted, or might have some limited options for export.
If there are limited options for export, this section will detail the available locations and any restrictions on using those (such as the size of the table being exported). These restrictions are enforced automatically when you try to take actions within the system.
If there are no options for export, you still have the option to work with your data in a Redivis workflow, where you can reshape and clean your data using transforms, and then analyze it in notebooks.
Learn more in the Work with data in a workflow guide.
If you would like to export the entire table, you can skip this step.
Otherwise, we recommend that you use transforms to cut your data down to a smaller size and reshape it into the table format you need for analysis before exporting it. Especially for large datasets, Redivis tools are created specifically for these purposes and work seamlessly in the browser with no additional setup.
Learn more in the Reshape tables in transform guide.
You can open the Export modal to initiate an export from any table. You can do this on the dataset page by right clicking a table and selecting the Export menu option, or by right clicking on any table in a workflow and selecting the same option.
Here you can see all the options available for your table to export.
If your table has export restrictions set by the data owner, options will be disabled here. In the case where your table does not meet export requirements but the data owner allows exception requests, you will see the option to do so here.
Download
The first tab of this modal gives you different options for formats to download your data. Most common data formats are supported, including csv
, json
, avro
, parquet
, SAS
, Stata
, and SPSS
. Select the format you'd like and click the Download button to start downloading the file.
Learn more in the Downloads reference section.
Programmatic reference
You can reference this file from your computer or another computational environment using the Redivis Python and R libraries. This modal gives specific information on how to reference this dataset, and you can reference our docs for other options.
Integrations: Google Looker Studio
This is a free Google dashboard visualization program you can directly link your table to. Use their point and click interface to build common visuals that will update as the underlying table updates.
Learn more in the Google Looker Studio reference section.
Integrations: Google Cloud Storage
You can export your table directly into a Google Cloud Storage bucket that you have access to.
Learn more in the Google Cloud Storage reference section.
Integrations: Google BigQuery
You can export your table directly into a Google BigQuery project that you have access to.
Learn more in the Google BigQuery reference section.
If the work you're doing leads to a publication, make sure to reference the dataset pages from datasets you've used for information from the data administrators on how to correctly cite it.
Redivis combines powerful functionality to reshape and analyze data on platform, with an easy export and publishing flow, to ensure the results of your work can be displayed in the format of your choice and shared with your collaborators and research community.
Working to determine, first, the broader picture of the content you'd like to publish and where you'd like to publish it will help you, then, determine the desired shape and format of the assets to be exported and, finally, the specific sources of the data in your Redivis workspace.
Ask yourself: What story are you trying to tell? Who/where is the audience? What component pieces are crucial to building the narrative? Your ideal package of assets may be very different if you're hoping to share progress on a workflow with a principal investigator, take a snapshot of a single variable distribution for a colleague, publish results in a journal, or build a custom dashboard to highlight multiple trends.
You may use a combination of tables, unstructured files, code snippets, graphs, and descriptive text to be published, so sketching out these component pieces in increasing detail will help define your end product.
With an initial strategy in mind, learning about the different types of Redivis exports will help define your list of assets to generate.
Tabular data containing rows and columns can be exported in a variety of formats, accessed programmatically from many environments via our client libraries, or embedded in a website.
Unstructured data files of any type can be previewed and downloaded in their original format, or accessed programatically.
Notebooks containing code inputs and corresponding outputs can be exported as an .ipynb file or PDF or HTML, or (coming soon!) embedded in your site.
Learn more in the Export to other environments guide.
In a workflow, you can generate output tables to capture the result of a set of data transformations, or use notebook to show a line-by-line data analysis flow and relevant figures.
As a dataset creator, you can upload your own tabular data and unstructured files to create assets to use in your workflows or share with others.
Whether you're manipulating data in a workflow to showcase results or hosting your own dataset, you'll want to build a set of specific tables or notebooks you're trying to share. As you iteratively modify the shape and content of these assets – transform your data in a workflow to output new tables or use notebooks to build analysis flows in Python or R – you'll fine-tune each piece of your final publication.
Augment your data analysis in Redivis by uploading your own datasets, with the option to share with your collaborators (or even the broader research community).
Learn more in the Create & manage datasets guide.
For full customizability, you can publish a static site that accesses Redivis data to power an interactive visual dashboard – a more permanent, web-based approach to highlighting results of your data analysis or generated data.
Learn more in the Build your own site guide.
We've started building a library of common actions that you might take when administering an organization.
Please let us know if there are other actions or concepts you would like us to provide examples for by reaching out or emailing contact@redivis.com
Any type of file can be uploaded to a Redivis dataset, and we support previews for the most common file types. These data can be analyzed in notebooks within workflows.
Note that if you have tabular data we strongly recommend uploading it as a table (rather than a file) so you and your researchers can take advantage of our extensive toolkits for previewing and manipulating tabular data.
This guide assumes you have already started by Creating a dataset.
You can upload data directly from your computer, or import from a linked account.
If importing, you'll want to get the relevant external account configured to your Redivis account before getting started.
The import tools allow for multiple uploads, so it's helpful to have them all in the same place.
When you're ready click the Upload files button on the Files tab of the dataset page to get started and select the location your files will be coming from.
If you select "Computer" then you will get a browser window where you can choose the file, files, or folder you want to upload from. Choosing a different location will bring up the corresponding mechanism to select your files.
Once selected, you will need to choose a Folder to put them in. All files on a dataset live within one folder which have a corresponding index table to help researchers understand and navigate those files. You can change this later if needed. If this is your first time uploading files to this dataset you will need to create your first folder.
Click the "Upload" button to start the process. If you are uploading from your computer you'll need to wait on this screen until it's complete. Closing your window will end the upload process. If uploading from a linked source, you can close this window to allow this process to continue in the background.
You can click on the file name in the Files tab to preview the file and see its information.
You can also go to the Tables tab to view the index table for the folder you've uploaded files to. View the cells and hover or click on the file ID variable to see the preview here. This table can be used in Transforms within a Workflow to work with files on a large scale.
If you want to rename a folder you can do so on the Files tab by right clicking on the folder name in the right bar.
You can also create new folders and move files between two existing folders from this bar. When moving between folders, you can use a conditional statement to only move some files that match your conditions.
Great metadata makes your dataset useable. Complete your metadata, along with configuring access, creating a sample, and releasing this version.
Learn more in the Create & manage datasets guide.
This guide outlines an example static site hosted by the Redivis team. Follow along at the Observable on Redivis Github repository to see all the details and build your own!
It's easy to use of Redivis data anywhere on the web, either by interfacing programmatically via our client libraries, visualizing via one of our export integrations, or embedding a table directly in another webpage.
To present Redivis data in a fully customizable way, you may want to build your own dashboard-style site. The following example shows a static site built with Observable Framework, deployed via Github Pages, that connects to Redivis data using our client libraries.
Identify the data sources on Redivis that will provide the content to build your site. You may be interested in presenting facets of a dataset already hosted on Redivis, highlighting tables you uploaded yourself, or showing outputs generated in one of your workflows. In our example, we'll use weather station locations contained in the GHCN Daily Weather dataset, and outputs from a public Redivis workflow that uses the same source data, to visualize locations and aggregate precipitation measurements from around the world.
To reference a Redivis table, choose the "Export table" option and navigate to the "Programmatic" tab (e.g., the GHCN Daily Weather "Stations" table), and see the code snippet for unique identifiers for the owner, dataset/workflow, and table. More details about referencing Redivis resources here.
With chosen data content in mind, the next step is to choose a set of tools to build our custom dashboard and deploy it to the web. Of the numerous choices, Observable Framework provides an approachable option with support for working with data in many programming languages (not just javascript), powerful visualization libraries built in, and excellent documentation. To start with a copy of our example dashboard, you can clone our public Github repo and follow the README.md instructions to develop a similar site in a local environment. To create a brand new project, and to reference additional details and development strategies, see the Observable Getting started guide.
You'll need to generate a Redivis API access token with appropriate permissions. See Authorization in our API documentation to create your own token to read data from Redivis.
To successfully authorize data fetching functionality in your development environment, be sure to export a REDIVIS_API_TOKEN
variable to your path, with the following terminal command:
export REDIVIS_API_TOKEN = 'MY_API_TOKEN'
IMPORTANT: API access tokens operate like passwords, and can allow another party to access resources on your behalf.
You should never share tokens, and avoid committing to source control where collaborators may have access (either now or in the future). See the Deploy and share your site section below for details on how to use secrets to store a Redivis API token necessary for your project.
In Observable, a data loader is a file that accesses data and writes it to a standard output channel for consumption by renderers, to be ultimately displayed on your site.
In our example, we'll illustrate use of both javascript and python to pull data from Redivis into our dashboard, using the Redivis API via js and python client libraries.
In a web environment, the redivis-js client library allows for fetching and manipulating data via simple javascript functions.
In our example, we first install the library by specifying the latest version in our package.json
and running npm install
in our development environment to install or update all specified packages.
Then, in the Observable data loader file redivis-js-precipitation.json.js
, we authorize
access, and then listVariables
and listRows
from our specified precipitation table.
In a python environment, the redivis-python client library allows for fetching and manipulating data via simple python functions.
In our example, we'll create a requirements.txt
file that specifies the latest version of the client library, both for development and future deployment. Running the following command in your virtual python environment will install pip, and then any package specified in the requirements.txt file.
Then, in the Observable data loader file redivis-python-geodata.json.py
, we create a pandas dataframe from our specified stations table, and output the records.
With data loaded and parsed into JSON format, it can now be manipulated and presented as you see fit. Among a wealth of modern web visualization tools is Observable Plot, which provides a polished selection of simple visualizations and corresponding functions to group and filter data, and pairs nicely with our development framework.
In our example, we use several Observable Plot functions to display our data on a map – using Plot.plot()
with Plot.geo()
to show a globe – and in a hex-binned histogram – using Plot.dot()
and Plot.hexgrid()
for data aggregation. We also add a range input to allow viewers to rotate the globe visualization.
Then, we use the built-in display
command to print the JSON-shaped payloads, to give the viewer a quick look at the raw data.
With our dashboard finalized, you can deploy your project anywhere on the web. Observable provides a quick command (npm run build
) to generate static files to be hosted on a server of your choice, as well as a deployment service if you'd like to host with Observable.
You can run npm run deploy
in the command line, and follow the prompts to deploy your project to an address on Observable. Further configuration and permissioning is available through your Observable account.
Github Pages provides a hosting service that's easy to integrate with your Github repository and any Github Actions needed to deploy your project.
In our example, we deployed our project to Github Pages in a few simple steps.
First, we need to specify a REDIVIS_API_TOKEN
to support Redivis client library authorization in the live deployment. We used "action secret" within the Settings of the Github repo, which sets an environment variable REDIVIS_API_TOKEN
to our appropriately scoped token, which is completely private while our repository remains public.
Next, we write a Github Action to build and deploy the project, which will run anytime we push to the main branch of our repository. The action is specified via the deploy.yaml
file, and relies on a set of open-source actions and our own custom code to do the following steps:
Checkout the code
Set up a Node.js environment, with a specific node version (20.6
)
Install our python dependencies (via pip
,specified in requirements.txt
)
Build the static files for the site
Deploy the files (in the generated dist
folder) to Github Pages via a specific branch (gh-pages
)
After adding this file to your Github repository, any push to the main
branch will trigger a deployment following the steps above, the status of which can be viewed under the Actions tab of the repository.
Finally, Redivis has hosted a number of different team projects at a custom endpoint specified in our organization's Github Pages settings, which you can see at labs.redivis.com.
Redivis datasets are a great place to host data for interrogating in a workflow, but you can also edit the underlying data.
Perhaps you found an issue with the source, or want to restructure it, before making it available to others. Redivis has all the tools you'll need to do this in a versioned, step-by-step process with transparency into the changes you've made.
If you haven't yet, upload the data you want to work with to a Redivis dataset and release the version. A personal dataset or one belonging to an organization works just as well.
Create a new workflow and add this dataset to it.
If you want to share the data transformation process with others for transparency, you can make this workflow public in the share modal. (Due to the way Redivis works, still only people with access to the underlying data will be able to see the data in the workflow even though the workflow is public.)
For this guide you can follow along in this workflow Demo tables edits.
Select the table you want to make changes to and create a new transform.
Use this transform to edit this table. Some common actions include:
Create new steps to rename and retype any variables you'd like to update.
Create a new variable with the same name. "Keep" this variable and "Discard" the original variable. Select the method as "Case (if/else)" and create the conditions you want to recode to.
This example shows a new variable with today's data as the "date_uploaded" but you could create any variable you want from the data in the table. Perhaps a aggregation that would be helpful to see with this data? Or the sum of multiple other variables?
You can transform this table however you want, using the Graphical interface or SQL code. Make sure to move all variables in this table from the "Discard" section to the "Keep" section except for any variables that have been replaced and will be left behind (in this case the store_and_fwd_flag from the source table).
Validate that this new table looks correct by looking at the output table below this transform!
Make changes to any tables in this dataset in this same wokflow.
Go back to the original dataset, and create a new version.
Open a table that you made changes to in your workflow, and click to "Import data."
Choose your merge strategy as "Replace" since you want to replace the existing table with the new one you created.
Select the data source as "Redivis" and type in the information from the workflow you made your changes. The sequence here will be:
Your user name
The name of the workflow (underscores replace spaces)
The table name (underscores replace spaces)
For example: username.testworkflow.table_name
Once it's done uploading you can validate that the data looks like it's supposed to.
Now you can close this table and repeat this process for any other tables in this dataset you've edited.
Once this is finished, you can add the link to the worfklow into this dataset's documentation or release notes along with a note on the changes. Remember that if you made the workflow public, anyone who has data access to this dataset can view the changes you made.
Now release this version! This new version of the data contains the edited data. Anyone using this data in a workflow will see that this dataset has a new version next time they open a workflow. You can see the release notes and the updated table of our Demo tables updates live.
Once your dataset is released, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.
Learn more in the Analyze data in a workflow guide.
The following example is illustrated on Redivis in the MarketScan Continuous Enrollment workflow – you'll need access to MarketScan data to view the details.
Many insurance claims datasets on Redivis contain information about enrollment, detailing the periods of time when an individual represented in a dataset was covered by an insurance plan. If you intend to characterize patients based on their insurance coverage (or lack thereof) during certain key events (procedures, diagnoses, etc), it's often important to identify periods of continuous enrollment for each individual – and capture each continuous enrollment period for each patient in a single row.
These claims datasets describe enrollment information in multiple discrete rows per patient, each corresponding to patient per month. However, an overall continuous enrollment period may be broken up across rows into months or other non-uniform chunks, so we'll employ the following process to combine multiple sequential rows into a single row with one start date and one end date, describing one continuous period.
In this example, we will process the MarketScan Enrollment Detail table to create a table in which each row describes a single period of continuous enrollment. We show an artificial snapshot of the dataset below, where patient 1 has multiple periods of continuous enrollment due to some gaps in coverage, and patient 2 has a single period of continuous enrollment.
patient_id
enrollment_start_date
enrollment_end_date
1
2012-01-01
2012-01-31
1
2012-02-01
2012-02-28
1
2012-04-01
2012-04-30
1
2012-06-01
2012-06-30
1
2012-07-01
2012-07-31
1
2012-08-01
2012-08-31
2
2012-01-01
2012-01-31
We want to create a final table with 3 rows for patient 1 to account for gaps in enrollment in March and May of 2012, and 1 row for patient 2. Our desired output has a row for each continuous period per patient, shown below:
patient_id
enrollment_start_date_continuous
enrollment_end_date_continuous
1
2012-01-01
2012-02-28
1
2012-04-01
2012-04-30
1
2012-06-01
2012-08-31
2
2012-01-01
2012-01-31
The variable names in this example are not actual MarketScan variable names. But, with appropriate data access, you can see the real variables used in the first transform of the Redivis example workflow by hovering over the (renamed) variables patient_id
, enrollment_start_date
, and enrollment_end_date
in the Keep section.
Throughout this example, we'll create variables that partition the dataset by patient identifier (here, patient_id
) to ensure that each patient is processed individually. But, to account for the fact that a single patient may have many periods of continuous enrollment, each spanning many months, we need to identify the correct start (from enrollment_start_date
) and end (from enrollment_end_date
) of a continuous period out of multiple rows and capture them in a single row.
First, we create a partition variable lag_end_date
using the lag method, which will order each row by enrollment_start_date
in a given patient_id
partition and copy the previous row's enrollment_end_date
value into the each row. We also create lead_start_date
using lead, to copy the following row's enrollment_start_date
value into each row.
These methods generate values which tell us how close the preceding and following enrollment periods are with respect the each row's enrollment period.
patient_id
enrollment_start_date
enrollment_end_date
lag_end_date
lead_start_date
1
2012-01-01
2012-01-31
NULL
2012-02-01
1
2012-02-01
2012-02-28
2012-01-31
2012-04-01
1
2012-04-01
2012-04-30
2012-02-28
2012-06-01
1
2012-06-01
2012-06-30
2012-04-30
2012-07-01
1
2012-07-01
2012-07-31
2012-06-30
2012-08-01
1
2012-08-01
2012-08-31
2012-07-31
NULL
2
2012-01-01
2012-01-31
NULL
NULL
In a second, downstream transform we'll create new variables, which will use the above lead and lag values to identify which rows correspond to the beginning and end of a continuous enrollment period.
First, we'll compare enrollment_start_date
and lag_end_date
to find the difference (in days) of the start of each period and the end of the previous period. and compare enrollment_end_date
and lead_start_date
to find the end of the period.
We see via diff_lag_end_enrollment_start
(created using the date diff method) which rows describe an enrollment period directly following the previous period, and which rows describe an enrollment period with a larger gap since the previous period. We also create diff_enrollment_end_lead_start
to identify gaps between an enrollment period and the next period.
patient_id
enrollment_start_date
enrollment_end_date
lag_end_date
lead_start_date
diff_lag_end_enrollment_start
diff_enrollment_end_lag_start
1
2012-01-01
2012-01-31
NULL
2012-02-01
NULL
1
1
2012-02-01
2012-02-28
2012-01-31
2012-04-01
1
31
1
2012-04-01
2012-04-30
2012-02-28
2012-06-01
31
31
1
2012-06-01
2012-06-30
2012-04-30
2012-07-01
31
1
1
2012-07-01
2012-07-31
2012-06-30
2012-08-01
1
1
1
2012-08-01
2012-08-31
2012-07-31
NULL
1
NULL
2
2012-01-01
2012-01-31
NULL
NULL
NULL
NULL
Next, we'll encode booleans from our difference variables to simplify further filtering. We'll identify rows corresponding to the start of a continuous period as those with a diff_lag_end_enrollment_start
value of either NULL (the row is the first period in the partition) or greater than 1 (the row comes after a gap in enrollment). And we identify rows corresponding to the end of a continuous period as those with a diff_enrollment_end_lead_start
value of either NULL (the row is the last period in the partition) or greater than 1 (the row comes before a gap in enrollment).
We see via boolean variables is_start_continuous_period
and is_end_continuous_period
(created the case method) if a given row corresponds to the start of a continuous period, the end of a continuous period, or both.
patient_id
enrollment_start_date
enrollment_end_date
lag_end_date
lead_start_date
diff_lag_end_enrollment_start
diff_enrollment_end_lag_start
is_start_continuous_period
is_end_continuous_period
1
2012-01-01
2012-01-31
NULL
2012-02-01
NULL
1
true
false
1
2012-02-01
2012-02-28
2012-01-31
2012-04-01
1
31
false
true
1
2012-04-01
2012-04-30
2012-02-28
2012-06-01
31
31
true
true
1
2012-06-01
2012-06-30
2012-04-30
2012-07-01
31
1
true
false
1
2012-07-01
2012-07-31
2012-06-30
2012-08-01
1
1
false
false
1
2012-08-01
2012-08-31
2012-07-31
NULL
1
NULL
false
true
2
2012-01-01
2012-01-31
NULL
NULL
NULL
NULL
true
true
Then, to capture only the start and end of a continuous period, we'll use a filter to keep only rows which are true
for either is_start_continuous_period
or is_end_continuous_period
.
This leaves us with either 1 or 2 rows corresponding to a continuous enrollment period. If a continuous period spans multiple rows (months, in this case), we'll have 2 rows (a start row and an end row). But if a period only spans one row, we'll have both start and end captured by that 1 row. In our example, the row containing the middle (neither start nor end) of the 2012-06-01 to 2012-08-31 enrollment period for patient 1 was dropped. We can also ignore our intermediate lead...
, lag...
, and diff...
variables, since our final processing step will only consider a row's is_start_continuous_period
and is_end_continuous_period
values.
patient_id
enrollment_start_date
enrollment_end_date
is_start_continuous_period
is_end_continuous_period
1
2012-01-01
2012-01-31
true
false
1
2012-02-01
2012-02-28
false
true
1
2012-04-01
2012-04-30
true
true
1
2012-06-01
2012-06-30
true
false
1
2012-08-01
2012-08-31
false
true
2
2012-01-01
2012-01-31
true
true
Finally, we want to collapse our table to ensure 1 row per continuous enrollment period per patient. Since we have only rows corresponding to start and end of continuous periods, we create another partition variable lead_end_date
(in the same transform is fine, since this step will happen after the previous filter) which copies the enrollment_end_date
value of the following row on to each row.
We can also use the partition row filter to keep only rows with is_start_continuous_period
as true
, since our lead_end_date
has copied over the end date of the continuous enrollment period, contained in each row's following row.
We end up with a table where each and every row contains both the start date of the continuous enrollment period and the end date of that continuous enrollment period.
patient_id
enrollment_start_date
enrollment_end_date
lead_end_date
is_start_continuous_period
is_end_continuous_period
1
2012-01-01
2012-01-31
2012-02-28
true
false
1
2012-04-01
2012-04-30
2012-06-30
true
true
1
2012-06-01
2012-06-30
2012-08-31
true
false
2
2012-01-01
2012-01-31
NULL
true
true
A final processing step captures the correct end date of a continuous period. We now have only rows whose enrollment_start_date
value contains the start of a continuous period, but these rows fall into two categories:
First, we have the rows that do not also correspond to the end of an enrollment period (where the is_end_continuous_period
value is false
). For these, we want to look at lead_end_date
, the end date of the next row, which represents the final date of the continuous period, since we filtered out all the intermediate rows above.
Second, we have which also correspond to the end of an enrollment period (where the is_end_continuous_period
value is true
) – in this example, if the continuous period was only 1 month. For these, the row defines a period (in this example, 1 month) that also contains the end date, so we just get the enrollment_end_date
from that same row. Note that the lead_end_date
value is incorrect in this case, since the next row contains the start of the next continuous period, or NULL if the period falls at the end of a partition.
We capture the above logic in a new variable enrollment_end_date_continuous
, created in an additional transform, since our previous operation involved a partition.
We end up with a final table below, containing the patient identifier, and the start date (renamed to enrollment_start_date_continuous
for consistency, and end date of each continuous enrollment period.
patient_id
enrollment_start_date_continuous
enrollment_end_date_continuous
1
2012-01-01
2012-02-28
1
2012-04-01
2012-04-30
1
2012-06-01
2012-08-31
2
2012-01-01
2012-01-31
Datasets are a core component of Redivis. They are a versioned collection of tables containing data, alongside rich documentation and metadata.
Datasets can be hosted by organizations or individual users, and every dataset has its own Dataset page. Datasets can be shared with other users on Redivis according to their access configuration.
Administrators can create datasets for their organization from the Datasets tab of their organization's Administrator panel. These datasets can be seen and managed by any administrator in the organization. When released they will be visible on the organization’s home page to anyone who as overview access tot he dataset.
Alternatively, anyone with a Redivis account can create a dataset on the Datasets tab of their Workspace. These datasets are by default only visible to their owner, and have simplified options to support sharing with your collaborators.
When you first create a dataset, it will be unpublished and only visible to other editors. This means you can edit the dataset and validate everything before releasing it. You can also validate and reconfigure its access rules before release.
At the core of every dataset is the data it contains, so we recommend starting here.
All data in a dataset is stored in tables. You can create a new table on the Tables tab of your dataset and start importing data. Redivis can upload data from your computer or another location you’ve linked, such as Box, Google Drive, AWS, and Google Cloud.
Once your data is finished importing you can validate that this table looks as you expect it to.
You can create more tables here if this dataset has multiple separate tables.
However, if your data is split across multiple files that all follow the same structure (such as a different file for each state, or each year of data but with generally the same variables) you will want to import all of these files to the same table, where they will be automatically appended together.
Learn more in the Upload tabular data as tables guide.
Metadata is essential to helping your researchers find and utilize your dataset. While some metadata will be generated automatically, such as variable summary statistics and counts, other metadata will require additional input.
On the Overview tab of the dataset there are multiple suggested sections to help break down information. You can fill out any of these that apply, such as the methodology, tags, contact information, etc.
Redivis will automatically generate citation and provenance information based on what we know about the dataset, but you can update this information with anything more specific.
If you have additional information that you want to include that doesn't fit one of these headers, you can create a custom section. Custom sections can also be set to be visible only to certain access levels if you have sensitive information.
You should also populate the metadata on each table. Tables can have a description, as well as an entity field that defines what each row in the table represents. You can also define the temporal and geographic range on the table, when relevant.
Each variable within a table has its own metadata. The variable name and type will be pre-determined from your data, but you should add a short label and longer description to each variable to help researchers understand what that variable measures.
Additionally, some variables will contain coded values, in which case you should provide value labels that represent the human-readable term for each code.
Learn more in the Create and populate a dataset guide.
Once you are ready to make your dataset available to others, you'll need to release it. You can click the Review and publish or Review and release button in the top right of the page.
You should also confirm your access settings you set up when creating this dataset.
Once this dataset is released, it will become visible and available to anyone who would qualify for access.
You can use tools right on Redivis to create new versions of your dataset.
Learn more in the Edit data in a dataset guide.
Administer your organization
Organizations allow for groups, centers, and institutions to more easily work with data by providing administrators with tools to effectively version and distribute data from a central location.
Contact us to set up a new organization, and learn more in the Administer an organization guide.
Datasets are a core component of Redivis. Consisting of documentation, metadata, and tables, datasets allow you to store, version, and distribute a wide variety of data.
Anyone with a Redivis account can create a dataset in their workspace, and organization administrators can upload one to an organization via the administrator panel.
You can create a new dataset by navigating to the Datasets tab of your workspace or administrator panel and clicking the New dataset button.
All datasets must have a name (unique to datasets for the user / organization).
You can set up your dataset in whatever order you'd like, but we recommend following the order below when getting started.
This data might be in a tabular format (.csv, .tsv, .sas etc.) or more rarely in unstructured data such as images and text files.
All tabular data is associated with a table and each dataset can have one or more tables. While you may release a dataset without any tables, this will be of limited use to other researchers, as Redivis provides numerous tools for understanding, querying, and generally working with tabular data.
If you haven't already worked with data in a workflow, we strongly recommend exploring that before creating dataset so you can understand how researchers will work with your data.
When you're ready to upload data, we have broken that out into a separate guide.
Learn more in the Upload tabular data as tables guide.
Unstructured data
For unstructured data, go to the Files tab where you can upload files from your computer or another location your data is stored via an integration. You can put these into folders and create index tables to better keep track of.
Note that any files uploaded here can't be transformed in the workflow tool or queried across Redivis (which require the table format).
Make sure any files you upload here contain this dataset's data. Any files with information about the data (such as data dictionaries or usage guides) should be uploaded as documentation on the Overview tab.
Learn more in the Upload unstructured data as files guide.
It's easy to feel "done" after uploading your data, but documentation and metadata are essential to the usability of your dataset. Moreover, rich metadata will improve the discoverability of your dataset by providing more information and terms to the Redivis search engine.
Metadata can always be updated after your dataset has been released. While good metadata are essential, it can be a time consuming and iterative process, so you might prefer to provide some basic content initially, and then improve it over time.
On the overview tab of the dataset, you can provide an abstract, detailed documentation blocks, supporting files and links, and subject tags for the dataset.
The abstract should be a brief overview of the dataset, while the rest of the documentation can be as thorough as you'd like. Each documentation block has a header for suggested content, and any you don't fill out won't be shown on the dataset page. These blocks contain rich text editor complete with embedded images. Most of this information will be visible to anyone with overview access, though you can also create custom documentation sections that require a higher level of access.
Make sure to audit your data's provenance information to give attribution to whoever is working on the data. If this dataset is part of an organization you can configure a DataCite account to issue a DOI for each dataset. Note that if your organization is configured to issue DOIs, then one will automatically be issued for this dataset when you first publish it.
To help users understand what each table represents, you should update the description, entity, and temporal range for each table in the dataset. The entity should define what each row in a table represents: is it a person? an event? a charge? The temporal range can be tied to a specific variable (using the min/max of that variable), or defined explicitly.
The tables in your dataset are made up of named variables, though rarely is this name enough to understand what the variable measures. On any table, click "Edit variable metadata" to order to populate the variable metadata.
On each variable, Redivis supports a label, description, and value labels. The label is the most essential item, think of it as a more human-readable variable name. The description should contain more detailed information, everything from caveats and notes to collection methodology. Value labels are only applicable when the variable is encoded with keys (often integers or short strings) that map to the actual value — for example, a survey might be encoded as 0: "No"
1: "Yes"
2: "Don't know"
3: "Declined to answer"
.
Editing variable metadata can be a tedious process, but Redivis does support the ability to import metadata from a file, and will also automatically extract metadata if it's present in the uploaded data files (e.g., Stata or SAS upload types).
Learn more in the Documentation reference section.
If your dataset is particularly large, or if you want to control access to a sample of the data separately from the whole dataset, you should configure sampling on your dataset. This will allow researchers to work with a 1% sample of the data during initial exploration, and allow you to grant access to the sample independently of the full dataset.
To update the dataset's sample configuration, click on any table, and then click "Configure sample". When configuring the sample, you can generate a random sample for each table, or sample on a particular variable that is common across tables. If researchers will be joining tables across your dataset, it is highly recommended that you sample on that common join variable so that researchers can contain a consistent 1% sample as they work with your data.
Learn more in the Dataset sampling reference section.
Before releasing your dataset, it is important to define who can access the dataset and what the procedures are for applying and gaining access. Click the Configure access button on the top of the page to set up the access configuration.
Datasets owned by organizations have more options for access than datasets owned by users.
Dataset access has five levels:
Overview: the ability to see a dataset and its documentation.
Metadata: the ability to view variable names and summary statistics.
Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.
Data: the ability to view and query a dataset's tables, and work with them in workflows.
Edit: the ability to edit the dataset and release new versions.
Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.
Even with data access, you may want to limit what other users can do with your dataset. Currently, you can configure export restrictions that limit:
The download location (e.g., to prevent researchers from downloading to their personal computer)
The download size, in bytes and/or rows
Enforce admin approval before any export
You may also add additional dataset editors to help upload data and provide metadata content. These editors will be able to create and release new versions, and will have full access to the underlying data, though they cannot add other users, modify the access configuration, or bypass the dataset usage rules.
If the dataset is hosted by an organization, all administrators of the organization will be able to the edit the dataset as well as its access configuration.
If the dataset is hosted by an organization, you will have additional options for configuring access to the dataset. The dataset can be assigned to a permission group to help standardize access procedures, and this permission group can contain requirements that help data managers fulfill contractual requirements and gather relevant information about the research being done on the dataset.
Learn more in the Configure access systems guide.
Congratulations! Your dataset is ready to be released and utilized by the research community. But first, it is highly recommended that you validate and audit your dataset beforehand. Take a look at the number of rows, variables, and uploads in each table. Validate some of the variable summary statistics against what you expect. And to be truly thorough, add the dataset to a workflow and run some queries as if you were a researcher. Catching a mistake now will prevent headaches down the line if researchers uncover unexpected discrepancies in the data.
Once a version has been released, the data can no longer be edited. While you can unrelease a version within 7 days, this should generally be avoided; you'll need to release a new version to modify the data.
When you're confident that you're ready to go, click the "Release" button on the top of the page. If the button is disabled, hover over it to understand what issues are currently preventing you from releasing.
After clicking the button, you'll be presented with a final checklist of tasks. When you click the Release version button, the dataset will be immediately released and available to all users with access.
This dataset is now also considered Published. If you need to pause all activity and access to this dataset, you can return to this page in the future and Unpublish it temporarily.
Once a dataset is released, you can return to it to make changes at any time. Changes to datasets are tracked in Redivis as versions. Anyone with access to a dataset can view and work with any of its versions.
How to work with versions when updating a dataset:
Any edits to the data content in tables will need to be released as a new version.
Edits to the dataset information, table information, or variable metadata can be made on the current version (or historic versions) and will be live as soon as it's saved.
Edits to the dataset name and access configuration will always affect all versions.
All data within a dataset is encapsulated in discrete, immutable versions. Every part of the dataset except for the name and access settings are versioned. All tables in a dataset are versioned together.
After releasing the first version of the dataset, you can choose to create a new version at any time by clicking the button in the top right "Create next version". This version will be created as vNext
, and you may toggle between this and historic versions at any time.
Subsequent versions always build on the previous version of the dataset, and changes made in the next version will have no affect on previous versions. Alongside modifications to the dataset's metadata, you may create, update, or delete any of the previous version's tables.
When uploading data to a previous table, you can choose whether you want to append these new uploads to your existing data, or replace the entire table with the new data.
Redivis computes row-level diffs for each version, efficiently storing the complete version history in one master table. This allows you to regularly release new versions and maintain a robust version history without ballooning storage costs.
Learn more in the Usage and limits for users, and Billing for organizations reference sections.
Once your dataset is released, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.
Learn more in the Analyze data in a workflow guide.
Organizations are where administrators centrally manage datasets and who can access them. All organization administration happens from the Administrator panel which administrators can access from the Organization home page.
If you're interested in using Redivis for your research group, center, or institution, contact us to get an organization set up.
You can add any Redivis account to be an administrator of your organization in the Administrator panel from the Settings tab.
Note that all administrators will have full access to the organization and its datasets, including the ability to modify access rules, approve access requests, and add or remove other administrators, including you.
To help brand your organization and best communicate the organization’s purpose to others, you'll want to fully populate and brand the Organization home page.
In the Administrator panel on the Settings tab / Public profile section you can customize:
The organization's full name
An organization description which will appear at the top of your about section, and in any institution searches
A custom brand color
The organization's logo
The data portal's cover photo
A header image that will appear in the top left of users' windows when on your data portal
Header links to other resources, such as organization-specific documentation, events, and other resources.
Rich text and images that will display in the main content section of the data portal. A place to provide high level information about your organization and how researchers should user your data portal.
You'll need to create datasets to upload and distribute data to your researchers. You can create datasets from the Datasets tab of your Administrator panel.
Here are some pieces to consider before getting started:
It's worth thinking through your broader data management strategy. Assess the data that you currently have — what format are the files in? Where are they located? How are they organized?
Consider what datasets you'll want to create. Datasets on Redivis are made up of one or more semantically related tables, and each table can contain data from one or more uploaded files. Moreover, all permissions happen at the dataset level — if you grant a user access to a dataset, they will have access to all of its tables.
If you’d like to issue DOIs for your organization’s datasets, you can configure your organization to issue DOIs in the Advanced section Settings tab. DOIs will make it easier for researchers to cite your data when they publish, and allow you to track publications using your data as well as their broader impact.
Learn more in the Create & manage datasets guide.
Before you can configure access to a dataset, you'll want to consider and build out the relevant access systems for your organization as a whole.
If you have a smaller organization where you know everyone working with your data, you might want to consider directly granting access to datasets to specific members, or granting access to everyone who is a member.
This can be set up on any dataset individually or you can create a Permission group on the Permission group tab to save the entire configuration of access levels and assign it to multiple datasets.
In many cases, however, you’ll want to develop more process-driven access control systems and gather additional information from your members To do so, you can set up requirements that must be completed by members or their studies, and approved by an administrator, before gaining access.
You can build requirements on the Requirements tab of the administrator panel, and save groups of them together as Permission groups on the Permission groups tab which you can assign to multiple datasets.
Export restrictions
Finally, you may want to control how your data is exported from Redivis, if at all. While members with data access will always be able to work with data in Redivis workflow, you can limit their ability to export data. For example, you may choose to allow exports only to certain environments, or only upon administrator approval, or not at all.
Learn more in the Configure access systems guide.
As members and studies apply to access data, you may need to take action to approve (or reject) their access requests. You can find all of your pending access requests wherever you see an alert on the administrator panel.
As users join or organization and apply to work with data, they will appear in the Members tab of your administrator panel. You can also invite anyone to become a member by adding them to the members list.
Studies
Studies allow for researchers to organize their work around a conceptual study, or research project, and for you to grant access in the context of that study. Any studies that are working with your organization's data will appear in the Studies tab of the administrator panel automatically, and you can approve any requirements for studies here.
Learn more in the Grant access to data guide.
Redivis provides tools to help you keep an eye on your organization and how its data is being utilized.
You can filter, sort, and change the columns for any list in your administrator panel, and download the corresponding report for a snapshot of current information. You can also dig into the logs and generate analytical reports for more information.
If you are part of a larger institution, you can contact us about getting an institution-wide Redivis page to help users discover new organizations and datasets.
One of the strengths of Redivis is that researchers can apply for, and gain access to, data on the same page as where they discover the dataset. Moreover, all access applications are tracked, both for administrator's reference and to provide transparency and timely notifications to researchers.
Responding to access requests in a timely manner creates a great environment for researchers and takes full advantage of these automatic systems.
You can configure multiple email addresses to receive an email when a request for access is made to your organization, as well as the frequency you receive them. Especially if you are a smaller organization that isn't getting access requests every day, we very strongly recommend that you have at least one administrator receiving emails so you know when to respond to a request.
Learn more in the Settings reference section.
Users must be approved as members of your organization before any other access requests can be approved.
When a membership request is pending you will see an Alert (!) on the Members tab of your workspace. Click on any member with an alert to see their pending submission awaiting approval.
Pending members will have an Approve / Reject button as soon as you click on their page, alongside the authentication information they submitted when requesting to be a member. It is important to verify that this person has a genuine email that seems consistent with who they say they are.
As users ask for approval you will receive an alert (!) in the administrator panel where your attention is needed.
Member requirements and export restriction exceptions will appear on the Member tab.
Study requirements will appear on the Studies tab.
Direct access requests will appear on the Datasets tab.
When all of the pending requests are resolved, the alert will disappear.
For each request you will have the option to approve or reject the submission. You can always come in later to revoke the approval if needed. You might also see that the member has submitted updates to their existing approval. In that case the approval is still valid while updates are pending, unless it is otherwise revoked or expired.
Learn more in the Approving access reference section.
Both administrators and members have the option to leave a comment on any requirement submission. The act of leaving a comment will trigger an alert for administrators, or a notification for members.
Especially when you are rejecting or revoking a requirement submission, we recommend leaving a comment to let the member know what they did wrong, or what they still need to do to gain access.
It's often a good idea to double check which researchers have access to your datasets.
In order to audit a particular member's access to your organization's datasets, double click on that member, and navigate to the Access overview tab. This will list all of your organization's datasets, and the member's corresponding access to each.
In order to audit all members that have access to a particular dataset, click the filter icon to the right of the member search bar. This will allow you to filter all members by their access to, and usage of, your organization's datasets.
If you are part of a larger institution, you can contact us about getting an institution-wide Redivis page to help users discover new organizations and datasets.
All activity surrounding your organization and it's data usage is logged and accessible to organization administrators. This can be helpful for year-end reports or any potential data security concerns!
You can easily organize, visualize, and download any of your organization's information by generating a report in your administrator panel.
Navigate to the Reports tab of your administrator panel and click the + New report button.
In the future you can return to this page to see your existing reports and create duplicates to further edit.
Start by choosing the entity, or topic, of the report. Whatever you choose here will define what each entry in your report represents.
You can choose a Resource such as Members, Workflows, Studies, or Datasets which will allow you to get a sense of the named entities your organization owns or has interacted with.
Or you can choose a Usage event such as Queries, Notebooks, Exports (or all of these combined) which will allow you to see how your organization's data has been used.
From here, define the other criteria of your report:
If you choose a usage event as your entity you will need to set a time frame to define which events will be included. (Resources don't have a time associated with them since they are not events.)
If you chose a usage event you can optionally choose to aggregate events based on a resource or period of time. Perhaps you want to see all queries grouped by member, or want to see all notebooks per day. You can also create custom aggregation groups by using member labels and dataset labels with colons. E.g. labeling members with school:medicine
and school:business
would allow you to aggregate a report by school
.
You can additionally choose a filter that will restrict your report to only events or resources that meet your chosen criteria. For example, if you want to see all usage events that referenced a particular dataset.
Learn more in the Reports reference section.
Finally, you'll need to select what fields or topics are present in the report. Every field selected here will become a column in your resulting report table. Different fields are available based on what entity you have selected.
Some fields are selected by default that might be helpful but you can make your own selections. For example, maybe you have a report of data usage and you want to include the average number of tables referenced in each notebook session, or the max amount of compute used in queries per day.
Click the Create button to generate this report! This will generate a table with entries that match all the criteria you've defined above. You can browse this table to understand the data and make sure it looks how you'd expect.
If your report contains any numeric data you can click the Visualize tab to see a visual representation of 1-2 of your numeric fields. The type of graph will be chosen by default based on the shape of your data. You can hover on this graphic to learn more about each entry.
You can update this report by clicking the Edit button. Changing any criteria will refresh the table and visual with new contents based on your new criteria.
This report is now saved in your organization's reports and you can access it in the future.
Any time you edit the report or come back to it after more than one hour has elapsed, the table will be automatically regenerated. Based on the time frame you've selected this might drastically change the contents.
If you'd like to work with the data in a different environment you can download the report table as a csv by clicking to Download button.
If you are part of a larger institution, you can contact us about getting an institution-wide Redivis page to help users discover new organizations and datasets.
One of the main tasks of organization administrators is to manage who has access to datasets. Redivis has tools to support any level of complexity in data systems for your restricted data and makes it easy for organization administrators to define, manage, and audit access rules across these various use cases.
Do you mostly have public datasets? Will you have a small group of researchers with access to pretty much every dataset? Will you need to gather information from researchers before they use your datasets? Do you have high risk data that can't ever leave the system? Or some combination of the above?
Redivis supports access systems for all of these configurations – for organizations of varying sizes – but you'll want to use different tools to achieve these very different goals.
For example, if you have few datasets, you might find that permission groups for bulk managing access rules aren't particularly helpful. Or if you have a small group of researchers who you all know closely, you might not need to set up process-driven access rules through requirements, and can instead grant access to certain researchers on an individual basis.
We'll walk through all the tools available to organizations for managing access, but you'll likely want to pick and choose from what's available to meet your organization's specific needs.
All interactions with data on Redivis require the user to have the appropriate access level to the data for a given action. Ideally your data would be as permissive as possible to allow for the greatest exploration from researchers before they need to start applying for access.
Dataset access has five levels:
Overview: the ability to see a dataset and its documentation.
Metadata: the ability to view variable names and univariate summary statistics, but not to retrieve any identifiable information or multivariate relationships from the data.
Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.
Data: the ability to view and query a dataset's tables, and work with them in workflows.
Edit: the ability to edit the dataset and release new versions.
Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.
We strongly recommend making your dataset's metadata as open as possible. This will reveal variable names and aggregate summary statistics, but will not allow researchers to view, query, or export the raw data in any way.
Being able to see metadata greatly improves researchers' discovery experience, and allows them to better assess a dataset's utility upfront, and even reduce your administrative workload. If researchers can understand a dataset before applying for access, they'll be submitting fewer access applications to datasets that are ultimately a dead end.
Learn more in the Access levels reference section.
Membership
Anyone wanting to apply for access to restricted datasets in your organization must first be a member. You can configure whether memberships are restricted to certain identity providers (such as your institutional login), and whether they are approved automatically or require administrator review. You also have the option to configure access to datasets to "All members."
Direct access
Permission granted directly to a researcher to instantly gain access to a dataset at a specific level. Researchers can also request access to datasets with this configuration.
Example usage: a dataset that will only be shared with a small number of people who are already known to administrators
Member requirements
A form for members to fill out. This can be set to require approval from an administrator or be automatically approved. It can also have an expiration date. These are global to your organization and when assigned to multiple datasets a user will only fill it out one time.
Example usage: a demographic form gathering researcher personal information, or a data use agreement signed PDF
Study requirements
Similar to requirements, but instead of each user needing to fill them out individually, only one requirement needs to be completed for the entire study, which can include multiple researchers. A single researcher may also have multiple studies. Each study working with the dataset will need to fill out its own study requirement, and any queries or exports of the data will be tied to that study.
Example usage: a funding proposal for a research project
Data export restrictions
A rule defining that a dataset can only be exported to a specific export environment, as configured on the Administrator panel Settings tab.
Example usage: limiting exports to a specific server environment
Learn more in the Configuring access reference section.
If you want to work with requirements, you'll want to get started making them and planning out how they will work across datasets.
Perhaps you want one requirement for all members to fill out about their field of research, which is necessary to gain access to any of your datasets, but another 4 requirements with different data use agreements that will apply only to their specific datasets.
To get started, go to the Requirements tab of the administrator panel and click the New requirement button. You will need to start by selecting if this will be a Member requirement or a Study requirement.
You can use the form builder to collect different information, including standard form responses, file uploads, and e-signatures.
Learn more in the Requirements reference section.
A permission group is an access configuration that can be assigned to multiple datasets and managed centrally.
You don't need to use permission groups, and it might not make sense to do so if each of your datasets has a different access configuration and you aren't using requirements. But if you have any overlap between datasets and want to enforce consistency, or want to use requirements you'll want to make one.
To get started, go to the Permission groups tab of the administrator panel and click the New permission group button.
This interface requires you to set an access paradigm for each access level of the dataset.
Perhaps you will set the overview access level to be public, the metadata access level to be available to all members, and data access level to be direct access (meaning you will have to directly grant access to users, or respond to their requests for access).
Or perhaps you will set the overview access level to public, and assign multiple requirements to the metadata and data access levels (meaning that anyone who is approved for all of the requirements will automatically gain that access level).
For any case you can assign data export restrictions here and choose whether you want to manage access to the dataset's sample (if it exists) differently than the full data.
If overview access to a dataset isn't public, non-approved users will not be able to see the dataset or its name in any way. In some cases, this may be the intended behavior, but remember there will be no way for researchers to apply for these datasets.
Instead, for these hidden datasets, an administrator will need to first explicitly grant overview access before researchers can view the dataset and request further access.
Learn more in the Permission groups reference section.
Finally you'll need to apply these access permissions to actual data!
Open any of your datasets and click the Configure access button on the top of the page. This configuration setup will look very similar to configuring the permission group.
You can either create a custom configuration here, or you can assign this dataset to one of the Permission groups by clicking the dropdown menu in the top right corner of this modal.
This is also where you will manage any direct access requests for the dataset.
As an administrator of this organization, you will have access to all datasets no matter what your access configuration is or what requirements you have filled out.
We recommend checking that your access system works as you expect by either looking at a dataset while you are logged out or in your browser's incognito mode, or by making a second Redivis account using a different email address that is not linked to your administrator account.
Grant access to data
You have some shiny access systems, but they won't work if you don't approve user requests for access.
Learn more in the Grant access to data guide.
If you are part of a larger institution, you can contact us about getting an institution-wide Redivis page to help users discover new organizations and datasets.
Redivis notebooks a performant, flexible environment for analysis that allow you to analyze and visualize data in workflows in Python, R, Stata, or SAS. With the notebook computation happening on Redivis, you don't need to configure an environment on a local machine or server, or export data from Redivis. This makes for easy iteration and collaboration, not to mention ensuring better security and data throughput.
Once you have a table that you're ready to analyze, you can create a notebook by clicking the + Notebook button at any time. You'll need to name it and choose a kernel (Python, R, Stata, or SAS).
Notebooks can only reference tables within their workflow, so we recommend keeping all related work together in the same workflow.
The default notebook configuration is free, and provides access to 2 CPUs and 32GB working memory, alongside a 60GB (SSD) disk and gigabit network. The computational powerful of these default notebooks are comparable to most personal computers, and will be more than enough for many analyses.
Clicking Edit compute configuration button in the start modal or the toolbar will allow you to choose from different preconfigured machine types. The notebook will then default to this compute configuration each time it starts up.
Notebook nodes need to be started in order to edit or execute cells. When first clicking on a notebook node, you will see a read-only view of its contents (including cell outputs). Click the Start notebook button in the toolbar to connect this notebook to compute resources.
When you create a notebook for the first time it will start automatically.
To do meaningful work in your notebook, you'll want to bring in the tabular and/or unstructured data that exists in your workflow into your notebook.
Notebooks come pre-populated with templated code that pulls in data from the notebook's source table. You will need to run this cell to pull the data into the notebook, and you can see that it worked because this code will print a preview of the loaded data.
Any files with unstructured data stored in Redivis tables can be referenced by their globally unique file_id
. You can also reference these file_id's in any derivative tables, allowing you to query and download specific subsets of files.
When working with large files, you'll want to consider saving the files to disk and/or working with the streaming interfaces to reduce memory overhead and improve performance.
At this point, you have all the tools you need to work with your data in your chosen language. The Python, R, Stata, and SAS ecosystems contain myriad tools and libraries for performing sophisticated data analysis and visualization.
Notebooks can produce an output table, which you can sanity check and further analyze in your workflow by including in other notebooks or exporting to other systems.
All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them (much like a Google Doc).
Your Redivis account establishes your identity on Redivis, and provides a home for various forms of data-driven investigation and collaboration.
Because researchers regularly move between institutions, and often want to be able to reference and access historic work, it is strongly encouraged that you:
Only have a single Redivis account, much as you would only have a single Google Scholar account.
When you apply for access to an organization's datasets, they will be able to see some information pertinent to your membership in their organization:
Full name
Contact email
Email / institutional identifier
Affiliation, if provided by your institution's login provider
For more extensive full-workflow video walkthroughs, see our Events and press page:
Or read through our Example workflows:
By creating an account on Redivis, you can gain access to hundreds of restricted datasets and powerful querying tools. There is no cost to creating an account, and you may permanently delete your account at any time.
Your account tracks all of your workflows, queries, collaborations, and data approvals with different organizations. After creating an account, you will be able to apply for membership with various organizations on Redivis and request access to their restricted data.
E.g.: I want the date of the first and last hospital admission for every patient in my table
Let's take an example table which contains all hospital visits in one facility over many years. Each record contains information about what happened in the visit, including a patient id to identify who was present. An individual patient might be represented in one or many records in this table since each record represents a separate hospital visit.
Let's say that for our research we want to find the first and last encounter — or more formally, the min and max encounter date for each patient.
In this example we will use a partition to create a new variable since we want the rest of the data to remain unchanged.
Conceptually we will define our aggregation variable as the date of the encounter since that is where we want to look for the min and max values.
We will define our partition as the patient id since we want to find the min or max for each patient. This will limit the min or max value scan to all values of our aggregation variable across each unique value of our partitioned variable (patient id).
For this particular case, we want to find the MIN and MAX of the encounter_date
of each patient; in other words, partitioned on patient_id
. This will create two new variables, min_encounter_date
and max_encounter_date
for each patient.
When we run this query, note that no records will be dropped; rather, these new variables will be have consistent values for any given patient_id
.
Now, let's say we only want to keep records which contain each patient's first and last encounter. We can do this easily be creating a new row filter:
We can also generalize this approach to find the Nth encounter for each patient — take a look at the RANK, DENSE_RANK, and ROW_NUMBER methods.
And if we want to be more specific on our partition (e.g., for the first / last encounter in a given calendar year), we can always apply additional variables (year
) when defining our partition.
Redivis has some built-in methods for helping you find and manage all of your organization's members, workflows, datasets, and access systems that you can use to ease your workflows.
In this example we will imagine that we need to get in touch with everyone working with a specific dataset.
We can go to the Datasets tab of the panel and search or filter to locate our dataset.
We can right click on this dataset and hover on View related to see all of the options relating to this dataset. In this case we want to see all workflows that have used this dataset.
Clicking this option navigates us to the Workflows tab, with a filter active for this dataset. This list is showing all workflows that contain this dataset.
Click the Select all checkbox to select all of these datasets. With these selected we can right click on any of them (or click the Actions button that appears in the top right) to View related activity.
In this menu we should select Members in workflows. This will navigate us to the Members tab filtered by all members who are owners or collaborators on the workflows we had selected.
Now that we have narrowed it down to the members we are interested in, we can right click on the header (or the options menu in the top right of the list) to edit the columns shown. In this menu we can add the Contact email column.
The quickest way to get this information out of Redivis is to go back to that same menu and click Download CSV to begin downloading a CSV of the list we are looking at exactly as you've configured it (all columns, filters, and sorting will be preserved).
We can use these similar steps to view:
Queries run on a specific set of datasets
Exports made from a workflow
Requirements a member has filled out
And many more!
Notebooks on Redivis offer a performant and highly flexible environment for doing data analysis. This includes the ability to run state-of-the-art machine learning (ML) models, including the ability to train new models, fine-tune various existing models, and use these models to perform inference and generate novel outputs.
This guide is focused on a number of common use cases when running ML workloads on Redivis. Here we generally focus on using the Hugging Face + PyTorch ecosystem in Python, though these examples are broadly applicable to other ML libraries and languages.
For a detailed example of using Redivis to fine tune a large-language model, see the complementary example:
Training ML models and running inference can require a substantial amount of compute capacity, depending on various factors such as your model and dataset size, usage parameters, and performance goals.
The default, free notebook on Redivis offers 2CPUs and 32GB of RAM. While this may work for initial exploration, running practical machine learning workflows typically requires the availability of a GPU. When creating your notebook, you can choose a custom compute configuration to match your needs.
Redivis offers a number of custom compute configurations, mapping to the various machine types available to Google Cloud. We recommend starting with a more modest GPU for initial exploration, and then upgrading as needed when computational or performance bottlenecks are reached. For this example, we'll use the NVIDIA L4 GPU, which provides reasonable performance at a reasonable cost.
The Redivis python notebook is based off the jupyter-pytorch notebook image, with PyTorch, CUDA bindings, and various common data science libraries pre-installed. However, if you require additional dependencies for your work, you can specify them under the "dependencies" section of your notebook.
Models can also be stored within a dataset on Redivis, in which case you won't need to pre-load the model, but can rather import it directly from the Redivis dataset (see step 3 below)
If your notebook references export-restricted data, for security reasons, internet will be disabled while the notebook is running. This can present a challenge for some common approaches to ML in python, such as downloading a model or dataset from Hugging Face. For example, we might reference a model as follows:
This code typically downloads the model from Hugging Face and caches it to our local disk. However, if internet is disabled, this command will hang and ultimately fail. Instead, we need to download the model during notebook startup, before the internet is disabled, as part of the post_install.sh
script under the notebook's dependencies:
This will download the model weights and other files to the default Hugging Face cache directory, ~/.cache/huggingface/hub
.
Now, within our notebook, we can load the cached model. Make sure to set local_files_only=True, so that Hugging Face doesn't try to connect to the internet to check for a newer version of the model.
Some libraries may require you to provide a full path to the model files. For the model mentioned above, this would be ~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/<snapshot_id>
.
To find the appropriate path, list the contents of the ~/.cache/huggingface/hub
directory and its decendants.
Machine learning models can be stored directly within a Redivis dataset as unstructured files. For example, this dataset contains the various files that make up the bert-base-cased model on Hugging Face. We can then download the model to our notebook's local filesystem:
And then reference this as a local model. E.g.:
When using models stored on Redivis, we don't have to worry about whether our notebook has internet access, nor do we need to rely on the future availability of that particular model on Hugging Face.
If your notebook has internet access, you can also use any other models that may be available on the internet. For example, we can load the same bert-base-cased model directly from Hugging Face:
Note that if you are working in a notebook with disabled internet, this approach won't work, and you'll need to use the methods mentioned in either 2a or 3a above.
As a final step, you'll likely want to load data into your notebook to either fine-tune the model, perform inference, or otherwise experiment. There are thousands of datasets on Redivis, and you can upload your own data as well. You can learn more about loading data into a notebook in the python notebooks documentation. As a quick example:
Of course, assuming your notebook has access to the external internet, you can also call various APIs to load external data sources.
At this point, you have all the tools at your disposal to perform cutting edge ML research. But of course, what you do next is totally up to you. We do recommend further familiarizing yourself with the examples and detailed documentation to take full advantage of the capabilities of Redivis notebooks:
In order to create a Redivis account, you must authenticate through your academic institution, through any Google account, or via passwordless login. Over 2,500 academic institutions from around the world are currently supported, with more being added regularly.
Redivis will request some basic information when you authenticate, such as your name, email, and any relevant affiliations; it will not have access to any other information through your Google or institutional account.
It is strongly encouraged that each individual has only one Redivis account, and that your identity on Redivis map to your real-world identity. However, it is common that you might have multiple emails that you want to associate with your Redivis account: personal email, university email, visiting-scholar university email.
When you join an organization, your membership in that organization will be associated with a particular login. In order to work with that organization's restricted data, you must have recently authenticated with the login associated with your membership. If you sign in to Redivis using a different login (for example, you signed in with your personal email, while your membership is associated with you academic institution's credentials), you'll be prompted to re-authenticate with the relevant login before you can work with restricted data.
If you lose access to the login originally associated with your membership — for example, if you change institutions — you can request that your membership be updated to associate it with a new login. An administrator of the organization will then need to approve this request before access through your membership will be restored.
If you have more than one Redivis account, we strongly encourage you to merge them. In order to merge multiple accounts, you should perform the following steps:
Identify which account should be your "primary" Redivis account once the process is complete.
Log in to your non-primary account.
Transfer ownership of all datasets, workflows, and studies to your primary account. Alternatively, you can delete any datasets, workflows, and/or studies that are no longer relevant.
Log in to your primary account, and add the authentication(s) that were previously associated with your non-primary account to your primary account.
Redivis offers extensive tools for previewing tabular data and transforming it in workflows, but the data needs to be uploaded correctly in a tabular format for researchers to utilize these tools.
If importing, you'll want to get the relevant external account configured to your Redivis account before getting started.
The import tools allow for multiple uploads, so no need to combine files together before importing them, but it's helpful to have them all in the same place.
On your newly created dataset, the first step to uploading data is to create one or more tables that the data will be uploaded to.
The data files you currently have may or may not be how you want to store them on this dataset, so it's important to think about your data's structure before getting started.
For example, if you have multiple files that all follow the same schema, we strongly recommend uploading them as one table (for example, if you have a separate table for each year, or a separate table for each state, but the structure of each is the same). In the example of one file per state, this would allow researchers to query across all states, skipping the first step of doing up to 50 joins. Additionally, you generally shouldn't split out tables for performance reasons — even when querying billions of records, Redivis will execute in seconds.
When you're ready click the Create new table button on the Tables tab of the dataset page and name your table to get started.
To get started uploading, choose the data source. By default this is your computer, but you can choose any option from the dropdown menu.
Next, choose the file(s) or enter the paths of the file(s) you want to import.
If you select multiple files here, they will be automatically appended in this single table on upload based on common variable names. If a variable is missing in some of the files, that's ok, it will just be recorded as null
for all records in that file.
Once your files are selected, click the Import button. If the files are coming from your computer, you might need to wait until they are finished uploading to the browser before they can be imported into Redivis.
As you upload files, you will see an overview of any files' progress and can click to view each file's data and additional information.
Once all uploads have completed, you can inspect the table (representing the concatenation of all of your uploads). Make sure to check the summary statistics and other analytical information to validate that the data are as you expected.
If you have more files to upload you can click the Manage imports button on the right side of the table at any time (up until releasing this version of the dataset).
Great metadata makes your dataset useable. Complete your metadata, along with configuring access, creating a sample, and releasing this version.
When working with non-public data on Redivis, it's often important to be able to authoritatively attest to their affiliation with a given institution or other entity. To this end, Redivis supports Single Sign-On (SSO) through most academic institutions, as well as the ability to establish identity through a Google account or validated email address.
In most cases, logging in with your institution will "just work". However, due to inconsistencies in how certain standards are applied around the world, you may run into issues when logging in through your institution. These issues can often be resolved with a quick ticket with your IT support desk – we recommend that you direct them to this page and copy support@redivis.com so that we may provide further technical information if needed.
Some common issues are outlined below:
This error will occur after you've logged in with your institution, upon being redirected back to Redivis. In this case, the authentication request completed successfully, but your institution didn't provide enough information for Redivis to know who you are (which is important in order for you to apply for restricted data, so that the data distributor can be confident of who they're granting access to!).
While uncommon, it's certainly possible that other errors might occur when logging in through your institutional credentials. If you do, please contact support@redivis.com and we'd be happy to help you troubleshoot.
Redivis also supports the ability to sign in via any Google account. This can be a personal gmail account, or via your organization if it supports Google single sign-on. When you sign in with Google, your name, email, and an opaque persistent identifier will be shared with Redivis.
If your institution supports Google sign-on, but is also listed as a SAML identity provider (see above), the SAML SSO will be preferred. If you try logging in via Google, you will be redirected to your institution's login page.
If your institution isn't listed and doesn't support SSO through Google (e.g., many @.gov emails), you can also sign in via any email address.
Redivis will send a unique code to this email every time you log in, making it such that the account owner continuously "proves" their ownership of the given email address.
Before working with a notebook you'll want to get started first by and adding data. You can then create a notebook off of any table in your workflow.
If you are working with very large tables (>10GB is a good rule of thumb), it's always a good idea to first reshape and reduce the data via , since they can be significantly more performant for large data operations than running code in Python, R, Stata, or SAS.
Python notebooks come pre-installed with a variety of common scientific packages for python.
R notebooks come pre-installed with a variety of common scientific packages for R.
Stata notebooks are based off of python notebooks, but offer affordances for moving data between Python and Stata.
SAS notebooks are based off of python notebooks, but offer affordances for moving data between Python and SAS.
All notebooks come with a number of common packages pre-installed, depending on the . But if there is something specific you'd like to include, you can add or write a by clicking the Edit dependencies button in the start modal or the toolbar.
Learn more in the reference section.
If you're working with larger tables, creating an ML model, or performing other particularly intensive tasks, you may choose to configure additional for the notebook. This will cost an hourly rate to run based on your chosen environment, and require you to purchase on your account.
Learn more in the reference section.
You can reference any other tables in this workflow by replicating this script and executing it with a different table reference. As a rule of thumb, notebooks will easily support interactive analysis of tables up to ~1GB; if your table is larger, try reducing it first by creating a , or make sure to familiarize yourself with the tools for working with larger tables in the notebook's programming language.
The notebook interface is based off of , and has similar capabilities. You can also export a read-only copy of your notebook as an .ipynb, PDF, or HTML file.
Learn more in the reference section.
to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.
If the work you're doing leads to a publication, make sure to reference the dataset pages from datasets you've used for information from the data administrators on .
(e.g., university credentials, personal google account) to your Redivis account, to ensure that you can always access it.
Redivis will never, ever distribute your personal information to a third party. We will send data access notifications to your contact email, though this can be disabled in your . You may opt-in to receive occasional product updates, though this is turned off by default.
To create a new account, click the Create account button at the top of any page. You can create an account using your academic institution's login credentials (e.g., you@your-university.edu), logging in with any google account, or via passwordless authentication through any email. Your account can then be managed from .
It is strongly encouraged that each individual has only one Redivis account, and your identity on Redivis should map to your real world identity. You can (and should) to one account — e.g., personal email, university email, visiting-scholar university email.
Computing a min or max value is an aggregate operation that scans all records and chooses the highest or lowest value. To compute the min or max value in a variable we will want to use an analytic method. We can do this either by our table (which would drop records and/or variables) or by using a partition to calculate a new variable without changing the rest of the data.
You can see more examples of using partitions with analytic methods on the step page.
In order to add an email to your account, navigate to the , and click Add authentication in the "Authentications" tab. A new browser window will open requesting the relevant credentials, after which this new email will be associated with your account.
Delete the non-primary account by clicking "Delete account" within your . WARNING: This action cannot be undone.
This guide assumes you have already started by .
You can directly from your computer, or import from a linked account.
When uploading files to tables, remember that every row in a table should represent the same "thing", or ; we wouldn't want to combine county-level and state-level observations in one table.
If you haven't already, we very strongly recommend experimenting with the reshaping and analytic tools in a workflow which researchers will use to work with your dataset. Knowing how they will work with it might inform how you structure it during this setup process, and can save time for everyone. You can even add your unreleased dataset to a workflow for testing — click on "View dataset page" from the dataset overview, and then .
For a full list of supported file types, as well as advanced functionality (such as wildcard imports) and error handling techniques, consult the .
Learn more in the reference section.
Learn more in the guide.
Redivis is a registered service provider within the US-based , which in turn is part of the , enabling secure, authoritative SSO across thousands of universities around the world, via the SAML 2.0 protocol. If you are a member of an academic institution (as well as certain other research enterprises), you can search for your institution by name and log in to Redivis through your institution's sign-in page.
If, when choosing your institution to log in, you are immediately presented with an error page (before you can type in your password), this likely means that your institution needs to add Redivis to some sort of "service provider allowlist". As a registered service provider within InCommon / eduGAIN, most institutions will automatically accept login request from Redivis – but some require manual configuration. In this case, your IT desk will need to take a quick action to enable Redivis – it will likely be helpful to direct them to Redivis's SAML metadata, found here:
Redivis requires all institution identity providers to provide some minimal information about the individual, such as name, email, and a persistent identifier. These are codified as the "". If your institution uses OpenAthens for SSO, you can to learn more about releasing these attributes.
patient_id
encounter_date
1
2012-01-01
1
2011-01-01
1
2010-01-01
1
2013-01-01
2
2009-01-01
2
2008-01-01
2
2015-01-01
3
2014-02-01
patient_id
encounter_date
min_encounter_date
max_encounter_date
1
2012-01-01
2010-01-01
2013-01-01
1
2011-01-01
2010-01-01
2013-01-01
1
2010-01-01
2010-01-01
2013-01-01
1
2013-01-01
2010-01-01
2013-01-01
2
2009-01-01
2008-01-01
2015-01-01
2
2008-01-01
2008-01-01
2015-01-01
2
2015-01-01
2008-01-01
2015-01-01
3
2014-02-01
2014-02-01
2014-02-01
Datasets have metadata fields that you'll want to complete to make it more discoverable and useful. The dataset abstract is required, but the rest of the documentation is optional and will be hidden to the public when empty. You can edit any of these sections by clicking on them in the dataset editor.
This metadata can be viewed on the dataset page and also downloaded in different formats. If your organization has configured DOIs this metadata will also populate the DOI listing.
The abstract is limited to 256 characters and will show up in previews and search results for the dataset. This should be a concise, high-level summary of this dataset.
This section is intended to display information about where this dataset came from and how it came to be in its current form. Redivis will auto-populate fields where possible but you can add additional information or override it.
This field should be the individual(s) or organization(s) responsible for creating the content of this dataset. This will be linked with the appropriate ORCID iD or ROR if the individual or organization has attached them to their Redivis account. You can also include individuals and organizations that don't have Redivis accounts and include their identifier.
This field is to attribute the work that different individuals have done to get the dataset into its current state. Redivis will automatically add anyone who edits the dataset to this field. If they have attached their ORCID iD to their Redivis account, that linkage will be shown as well. The contributor type options are aligned with DataCite standards.
If your dataset belongs to an organization, you can issue a DOI (Digital Object Identifier) by configuring your organization to issue DOIs. Any DOI issued for this dataset will remain in a draft status for seven days to allow for version unrelease. After seven days the DOI will become permanent.
This section shows the automatically generated citation for this dataset in your chosen format. This can be copied or downloaded for use elsewhere.
Changes made to the "Creators" field will be reflected in this citation. Any DOI issued for this workflow will automatically be included in this citation.
If this dataset must be cited in a specific way, you can make that clear in this field. Citation instructions will be included beneath the citation wherever it is shown.
This section is for linking other identifiers (URLs or DOIs) that are relevant to this dataset. The related identifier type options are aligned with DataCite standards. You might want to include identifiers for DMPs, papers, and more.
You can launch a bibliography which displays the citation of this dataset and every related identifier listed above.
These documentation sections are intended for more extensive information and can contain formatted text and images. These will be indexed for the Redivis dataset search.
Files of any type and up to 100MB can be uploaded to the dataset page where anyone with access can download them. These should not contain any data for this dataset, as access to them is managed separately.
Links can be added with display names to direct someone to another URL with more information.
This is where you can add the license information about your dataset's redistribution policies. If this data is governed by a common redistribution license you can select it here from the menu of standard licenses. If you want to reference a license that isn't listed here you can include the link, or upload a custom license. This will be displayed on the dataset front page to let others know how they can use your data. This information will be included on the dataset's DOI.
If this dataset was funded by an institution you'd like to recognize, this is the section where you can include information about funder(s). You'll need the funding organization's name and ROR, as well as an award number if applicable. You can add multiple funders to each dataset. This information will be included on the dataset's DOI.
This section should be used to let someone viewing this dataset know how to get in touch if there is any issue or question.
You can create documentation sections with their own titles and assign them custom access levels.
By default, all dataset documentation is visible to anyone with overview access to the dataset. However, there may be some content in the documentation that is sensitive — for example, information about named variables that would require metadata access.
To protect this information you can create a custom documentation section with a more restrictive access level. Users without the appropriate level of access will only see a placeholder for that section of the documentation.
In addition to documentation, you may add up to 25 tags to your dataset, which will help researchers discover and understand the dataset.
Additionally, information about the dataset's size and temporal range will be automatically computed from the metadata on its tables. Additional table documentation, as well as the variable metadata, will be indexed and surfaced as part of the dataset discovery process.
The information on the usage tab of the dataset page is automatically generated based on how researchers are working with the dataset on Redivis.
Basic information about this dataset's usage
Views: How many times this dataset page was loaded. These are only counted for logged-in users, and a user is only recounted after 1hr of not viewing the dataset.
Workflows: How many workflows include this dataset.
Featured workflows: How many workflows featured by organizations include this dataset.
The featured table shows all variables across all tables in this dataset sorted by their popularity. This popularity is calculated based on the number of times that variable is referenced when this table is used in workflows. This takes into account joining on a variable, propagating it downstream in transforms, and querying it directly.
The workspace is the home of your content on Redivis. It is created when you make a Redivis account and accessible by clicking the My workspace button in the right of the header bar from anywhere on the site. Your workspace is only visible to you and is where you will manage your work on Redivis.
Your workspace dashboard has links and information to help ground yourself in the work you are doing on Redivis. Beyond the dashboard, you will see pages listed on the left bar with different aspects of your account.
Any workflows that you've created, or that have been shared with you, will show up here. You can create a workflow by clicking the + New workflow button on this page in order to start working with data. Clicking on a workflow here will open up the workflow interface.
Any datasets that you've uploaded or can edit will appear here. You can create a dataset by clicking the + New dataset button on this page in order to start uploading and sharing your data. Clicking on a dataset in this list will open the dataset editor which will allow you to work on the dataset.
This page lists all organizations that you're a member of. If you have applied for access to any restricted data, a copy of your submissions will be listed under the organization you applied to. You can search for new organizations to join by clicking the + Find new organization button.
Studies allow users to more easily work in groups. You can create a study by clicking the + New study button on the studies tab of your workspace. Any study you are a part of will appear in your workspace. Any edits you make will appear for all study collaborators.
This section contains any dataset that you have bookmarked. You can manually bookmark a dataset by clicking on the bookmark button next the dataset title on any dataset page. Redivis will also automatically bookmark datasets that you add to prjoects, or where you have applied for access. You can remove these at any time by right clicking on a dataset in this list and selecting the Remove from library option.
All activity you take when working with data on Redivis is logged. You can visit the Logs page of your account to view the details of your Query, Notebook, and Export actions.
You can specify the following information, public to all users and organizations on Redivis:
Name: This is your full name displayed across Redivis. This will appear when collaborators search for you to share their workflows, and the name administrators will see when interacting with any access steps you've submitted. It does not need to be unique.
Username: A unique handle for your account.
Profile picture: A picture visible next to your name throughout the platform.
ORCID iD: Link your ORCID ID so organizations can see research you are working on. See orcid.org for more information on setting up an account.
Disciplines: Areas of interest and study that best represent your work and research.
You can use any email listed here to log in to your account. If you have more than one email address, the one marked as 'Primary' will be the one to receive any emails. You can authenticate with additional institutional or Google emails by clicking Add authentication button. If the email you add is associated with an institution or organization, the authenticated information we receive from them will be listed here. A single authentication will be used by organizations to verify your identity when you apply for membership.
To allow you to import your data from wherever it resides, Redivis supports multiple integrations with outside services.
Redivis will only ever read data from these sources; it will never modify or overwrite content. Additionally, Redivis will only access and transfer content when you specify a location in one of those sources as part of the data upload process.
You can create and manage tokens here for working with the Redivis API. See the API documentation for more information.
You can use compute credits to purchase advanced compute environments for notebook nodes within workflows.
This tab allows configuration of the following:
Contact email: Specify a contact email to receive email notifications for any update to your data access applications. If you apply to access data hosted by an organization, that organization will be able to see your contact email. By default, this is the email you used initially to create your Redivis account.
Communications: Allows you to configure email notifications for any update to your data access applications, as well as occasional product update emails regarding changes to Redivis. Note that, if you opt to receive email notifications, you will still see a comprehensive list of notifications in the web platform.
Security: Shows all current sessions of Redivis across devices and browsers, and provides the ability to log out of any of these sessions.
Delete account: Allows permanent deletion of your account, including all workflows you have created.
Note you cannot delete your account if you are the only remaining administrator of an organization, or if you still have any non-deleted datasets.
In order to enable Stata notebooks for your account, you can provide your Stata license information here (alternatively, you'll be able to use Stata if one of your organizations provides a license). Specifically, you'll need the license "code", "authorization", and "serial number", which should all have been provided as part of purchasing Stata.
In order to enable SAS notebooks for your account, you can provide your license information here (alternatively, you'll be able to use SAS if one of your organizations provides a license). Because of how SAS handles cloud deployments, the steps to enable SAS are a bit more complicated than for Stata. Specifically, you'll need to complete the following steps:
Run SAS Deployment Wizard to install SAS Studio on a supported Linux 64-bit operating system. During the installation, change the default location for the SAS Studio installation to /usr/local/SASHome
.
For details about how to install SAS Studio, refer to the installation and configuration instructions provided by the product.
Create a TAR file that includes the SASHome directory:
Once this tar file has been created, please reach out to contact@redivis.com to share the file so that we can enable SAS for your account.
Datasets are a core component of Redivis. Datasets contain various metadata and documentation, as well as one or more tables and/or files containing data.
All datasets have their own persistent URL and are uploaded by either a user or an organization. Datasets can be added to workflows to analyze and combine with other datasets across Redivis. Datasets are automatically versioned and you can always update the version you are viewing or working with.
Some components of a dataset may not be available to you until you are granted access. In order to see the existence of a dataset, you must at least have overview access.
You can create your own datasets to use in your workflows and share with colleagues, or create datasets within any organization that you administer.
A study is a group of users working together with a common goal. Studies allow users to apply for access to study requirements on datasets as a group. It also makes sharing workflows and user-created datasets easier.
You can create a study by clicking the + New button on the studies tab of your workspace. Any study you are a part of will appear in your workspace. Any edits you make will appear for all study collaborators.
You can add collaborators to your study. Collaborators can edit the study and will by default have edit access to all workflows created in this study.
When you create a study, you are assigned as the study's PI in the collaborators list. Organization administrators use this field to better track studies applying for their data.
If you add more users to a study you can change the PI assignment. Over on a user's name and click Make PI to transfer. Note that once you change the PI you will no longer have the ability to change the PI assignment.
You can also give a study a description, which will help administrators and other study members better understand its goals.
Any workflow created in this study will share edit access with all other collaborators in this study by default.
The list of datasets will be automatically populated as you use datasets in workflows in this study. You can all add datasets to it directly in order to make the access process easier later on.
Compute credits can be used to provision advanced compute environments for notebooks.
The default notebook configuration is a free resource on Redivis, and has access to 2 CPUs and 32GB working memory, alongside a 60GB SSD disk and gigabit networking. This is similar to most personal computers, and for many analyses should be plenty of compute power!
If you are working with larger tables, creating an ML model, or need to use a GPU you can choose to configure more advanced compute resources when setting up your notebook.
Advanced compute environments will cost a number of compute credits per hour to run depending on the machine type. This amount will be clear when configuring and starting the notebook. In order to start a notebook with an advanced compute configuration you must have enough compute credits to cover at least 15 minutes of running time.
Credits cost $0.10 and can be purchased in increments of 100 ($10), 200 ($20), 500 ($50), or 1000 ($100). Click the Purchase credits button and choose the amount you'd like to purchase.
You can purchase credits immediately using a credit card or bank account, or through an invoice.
When clicking Checkout you will be rerouted to a Stripe payment processing page. The card you enter here will be processed by Stripe and never seen by Redivis or stored in our databases.
You can choose to generate an invoice for your credit purchase. Any information you enter in the custom fields section will appear on the invoice and might be required by the organization paying the invoice. Once generated the invoice will be downloaded as a PDF to your computer. You can also return to view the invoice and copy the link to the invoice to pay electronically. Once the invoice is paid the credits will appear in your account.
You can set up your account to purchase credits every time your account dips below 10 compute credits. This ensures that your notebooks will never halt mid-session due to a lack of compute credits. Here, you can select the amount of credits you'd like to purchase every time this condition is met and the card you would like to be charged for it. If you'd like to change the amount or cancel auto-purchase you can do so by returning to this screen.
Credits on Redivis never expire and can be refunded at any time. If you would like to refund the compute credits on your account please contact us to initiate the refund process.
Redivis uses Stripe for all payment processing. If you'd like to view or edit your card on file or view previous purchases you can do so in the Stripe billing dashboard, which is linked from this page.
User accounts are currently limited to 10GB of storage. You can see your current usage on the Settings page of your Workspace, under the Compute Credits and Billing tab.
If you'd like to increase that limit and host larger datasets, Contact us about creating an Organization!
A table is where tabular data is stored on Redivis. A table can be thought of as a grid made up of variables (columns) and rows (records). You'll see tables:
On the Tables tab of datasets
Clicking any the header or cells in any column will display the variable's summary statistics. You can also right click a cell to filter by a specific value, or sort on that variable.
For any Geography type variables, you can hover on the cell or click to see a preview.
Choose one of the templates or type your own SQL query, and click the Run button to see the results. You can select any column in the query results to view the summary statistics for that column.
If you update the query after execution, it will become "stale" (marked in yellow), which is a reminder that the results no longer match the query you are typing. The previous results will still be shown to help inform the new query you are writing.
On in a workflow
Every table has a table viewer where you can explore the data, given you have appropriate . There are three tabs where you can view the data and interact with it in different ways:
Variables view (requires )
Cells view (requires )
Query view (requires )
The Variables tab provides a searchable list of the table's variables. In order to view this tab, you will need to the corresponding table.
Clicking any row will display the variable's statistics panel, which shows some common calculated from the variable's data.
The Cells tab allows you to quickly preview and explore the table's data as a grid, regardless of the table's size. In order to view this tab, you will need to the corresponding table.
The order of records within a table is arbitrary and non-deterministic, unless the table is the result of a transform or query that specifies an .
If the order of records is considered to be part of the "data" in your table, you should create an additional "record number" variable that can be sorted on. Alternatively, you can upload your source files as , in which case an exact copy of the original file will be stored.
Workflow provide the most robust and reproducible mechanism for querying tables on Redivis. However, you can execute SQL queries directly on a table for when you want to perform a quick, one-off analysis. In order to perform a query, you will need to the corresponding table.
The Query tab on a table contains a SQL editor with scaffolding to reference that table. The query syntax follows the same rules as the . Additionally query results may not exceed 10GB – if you need to materialize larger results, use a transform for your query.
Field
Notes
Name
The table's name. If in a dataset, must be unique across all tables for that version of the dataset. If in a workflow, must be unique across all tables currently in the workflow.
Description
Optional. A free-form description of the table's contents. May not exceed 5000 characters.
Variable count
Total number of variables in the table.
Row count
Total number of rows, or records, in the table.
Size
Total size of the table, in bytes.
Entity
Optional. Documents the concept that one record in this table represents. For example, the table's entity might represent a unique patient, or a specific hospitalization, or a prescription.
Temporal range
Optional. The range of time that this table covers. This can either be set manually, or calculated from the min/max of a particular variable.
If calculated from a variable, that variable must have type date
, dateTime
, or integer
. If the variable is an integer, its values will be assumed to represent a year and must be in the range [0, 9999]
.
Sample
If this table is sampled, you will see a marker for whether you are looking at the full dataset or the 1% sample. To interact with sampled tables, add the dataset to a workflow.
All datasets on Redivis are owned by either an individual user or an organization.
To create a dataset owned by you, navigate to your workspace, and under Datasets, choose to create a new dataset.
To create a dataset owned by an organization that you administer, navigate to the organization's administrator panel, and under Datasets, choose to create a new dataset.
For step-by-step guides for creating and editing datasets, see the accompanying Create & manage datasets guide
When first creating a dataset, you will be able to specify certain access rules for the dataset. However, these access rules will only take effect once the dataset is published, and can be modified before the dataset's publication.
While in the unpublished state, only dataset editors will be able to view the dataset, and the dataset won't be accessible to other users nor will it show up in any search results.
If a dataset is owned by a user, that user has full ability to edit and modify the dataset. If a dataset is owned by an organization, all administrators of the organization will have such rights.
You can also add other editors to the dataset by selecting Configure access at the top right of the dataset, and adding the specific individuals as editors. Editors will be able to upload data, modify metadata, and release new versions of the dataset, but they cannot modify any of the dataset's access rules or change the dataset's published status.
Dataset editors will be able to find this dataset in their workspace on the Datasets tab.
Under the dataset settings, additional options for the dataset can be configured:
You can rename the dataset here (or by clicking on the dataset title within the dataset editor).
These notes are only visible to editors of the dataset, and may provide helpful documentation for administrative processes around maintaining the dataset.
Control whether the dataset is published. If a dataset is unpublished, only editors will have access to the dataset, though you can configure access for other users such that they'll gain access when the dataset becomes published.
All datasets are initially unpublished when created, and become published when the first version is released. If you ever need to quickly remove a dataset from circulation, you can unpublish the dataset.
[Organization datasets only]. Whether the dataset should be featured on your organization's landing page.
[Organization datasets only]. If your organization has configured a DOI provider, you can enable DOI issuance for your organization's datasets. Based on your organization's settings, this will either be default-enabled or default-disabled for new datasets.
If you enable DOI issuance on an existing dataset, DOIs will be back-issued for all non-deleted version of the dataset. These DOIs will become permanent after 7 days.
If you disable DOI issuance on an existing dataset, any draft DOIs (less than 7 days old, or on an unreleased version) will be deleted. Any future versions will not be issued a DOI.
Learn more about dataset DOIs below.
You can permanently delete the dataset here. Once deleted, the dataset will no longer be discoverable, though it will still show up in users' workflows that reference the dataset, and bookmarked URLs and DOIs will still resolve to the dataset's landing page.
To ensure future reproducibility, dataset metadata and documentation is preserved upon deletion. However, all data will be fully expunged, and the dataset will no longer by queryable.
The dataset's public access level will be persisted in its deleted state – meaning that if the dataset was previously visible, it will still be visible (but less discoverable) once deleted. Additionally, any users who explicitly had overview or metadata access to the dataset prior to deletion will have their access persisted upon deletion.
When enabled, DOIs are issued for all released versions of a dataset. For datasets with more than one version, a "canonical" DOI for the dataset will be issued as well.
When issued, DOIs remain in a draft state for 7 days, and will be deleted if the dataset's DOI configuration is changed. After 7 days, DOIs become permanent (this 7 day counter only begins once a version is released; unreleased versions always have a draft DOI).
DOIs are issued for all released versions of a dataset. For datasets with more than one version, a "canonical" DOI for the dataset will be issued as well.
When issued, DOIs remain in a draft state for 7 days, and will be deleted if the dataset's DOI configuration is disabled while still a draft. After 7 days, DOIs become permanent (this 7 day counter only begins once a version is released; unreleased versions always have a draft DOI).
If a dataset is not publicly visible, its DOI will be registered, but it won't be findable – meaning that web crawlers won't index the DOI, and it won't be part of search results on platforms like DataCite. In such a case, the dataset's metadata will be redacted as well (see below).
When a dataset or version is deleted, any draft DOIs will be deleted. All other DOIs will be moved to the registered (non-findable) state, and they will continue to resolve to appropriate "tombstone" page for the dataset.
DOIs are more than just persistent identifiers — they are accompanied by rich metadata that allows other tools to surface your datasets and link them to the broader body of academic work. Every field on the dataset page maps to corresponding entries in the DataCite metadata; you can view these metadata by clicking Metadata -> DataCite on the right of the "Overview" section of the dataset page. Redivis will automatically sync the DataCite metadata whenever your dataset is updated.
Redivis will only ever publish metadata to DataCite that is publicly visible. This means that if your dataset is completely hidden from public view, no metadata will be propagated to DataCite (including the dataset's name). Instead, a placeholder mentioning that "This dataset has been hidden from public view" will be used in place, and only the name of the dataset owner, and the dataset's contact information (if present), will be published.
While this situation isn't ideal, it allows DOIs to still be used for disambiguation when discussing the dataset, and provides a potential pathway for future investigators to contact the data owner.
Note that metadata associated with your dataset will automatically be updated if you change the public access level of your dataset.
Redivis creates a DOI for every version of a dataset, allowing for authoritative references and reproducibility of data used in researchers' work. Once a dataset has more than one version, Redivis will also issue a DOI for the dataset as a whole, whose metadata will reflect the latest version of the dataset. All version DOIs will point to this canonical DOI, and also link to each other (as previous / next versions) in the DataCite metadata. This structure improves discoverability and disambiguation when referencing versioned datasets.
If you've uploaded your DOI issuing credentials in the Setting tab, all new datasets will by default be issued a DOI upon publishing and on every version released. If you would like to disable issuing DOIs for a specific dataset you can do so in the Setting tab of that dataset. Here you'll see a switch you can turn off to stop issuing DOIs. If the dataset is already published it will have a DOI for the dataset and all published versions that will not be removed. If you turn this switch on at any point in the future all versions will be back-issued DOIs.
Geospatial data on Redivis behaves quite similarly to tabular data: each feature is ingested as a single row within a table, alongside any metadata for this feature. This approach mirrors tools like PostGIS, R spatial features, and geopandas, allowing you to query and join your geospatial data, at scale.
Redivis supports importing geospatial data from several common GIS formats: geojson
, shp
, shp.zip
,kml
, and parquet
. Parquet files with geospatial metadata (often referred to as GeoParquet) are the most performant and robust option, though as a newer standard, these files are less common. For other geospatial file types, Redivis first converts the file to a geojson
representation (using the relevant ogr2ogr driver), and then imports the geojson into a table.
Each feature will be imported as one row, with the geometry
column containing the WKT representation for that feature. Additional feature properties will be mapped to variables in your table, with any nested properties flattened using the .
separator.
Note that Redivis only supports 2-dimensional, unprojected (WGS84) geometries. Other projections might cause the import to fail, and any extra dimensions will be stripped during ingest. If you are uploading a .shp.zip
that contains projection information, the geometries will automatically be reprojected as part of the import process.
.parquet
GeoParquet
.geojson
GeoJSON
.geojsonl, .ndgeojson .geojsons
Same as the .geojson specification outlined above, except each feature is given its own line. Importing .geojsonl (as opposed to .geojson) will be significantly faster.
.kml
Keyhole Markup Language
.shp
Shapefile
.shp.zip
Zipped ESRI shapefile directory
In addition to uploading geospatial data using one of the formats listed above, you can also import geographic data encoded within a text-delimited file (e.g., a csv
). In this case, the geographic data should be encoded as strings using the Well-Known Text (WKT) representation. This is also the same format used when exporting geography variables as a CSV. WKT in CSVs will be auto-detected during data ingest.
Limits for upload file size, max variables, and other parameters are specified here.
You can inspect an individual file by clicking on its name to launch the file viewer. Any file can be downloaded or referenced within a notebook, and many file types can be previewed directly within Redivis, including:
3D models
Audio files
CIF + PDB files (molecular + protein structures)
FITS files (common in astronomy)
DICOM
HDF5
HTML
Images
PDFs
Videos
Text/code
TIFFs
ZIPs
TEI
All files are assigned to a "folder" within the dataset. You can click on the folder name in the right grey bar to filter this dataset's files by the folder they are in.
Each folder has a corresponding index table that is present on the Tables tab of the dataset. These will match the folder name and have a File index label.
In these index tables, each row of the table represents a file in that folder.
The characteristics on a variable help researchers understand what that variable measures and how it was collected. All characteristics (except for variable type) are indexed by the Redivis search engine — better and accurate metadata will help researchers find your dataset!
Histogram
A frequency chart of observations, sorted into 64 buckets, clamped to three standard deviations of the mean on either side. To switch between a linear and logarithmic y-axis, click on the bottom left corner of the chart.
Only shown for continuous variables with more than 64 distinct values.
Box plot
A visual display of the distribution of values by frequency. Shown are the minimum, 25%, median, 75% and maximum value. To include or exclude outliers in the calculations, click on the label on the bottom left corner of the chart.
Only shown for continuous variables with a meaningful number of discrete values.
Frequency table
A table showing the frequency of common values for that variable, limited to 10,000 values. If a variable's values are highly heterogenous, no frequency table will be displayed. You can right click on values to get a quick entry point to filter on that value in the query tab.
Map
A table with variable of type geography
shows a heatmap of the geospatial dispersion of that variable.
To edit a variable's label, description, or value labels, click the Edit metadata button on the right of any table. This will allow you to any of the variables in this table, or you can navigate to the "All tables" tab to edit the same variable across multiple tables.
In the metadata editor, you can edit a variable's metadata fields as you would in any spreadsheet, and save changes when you're done.
All variables in Redivis have a type associated with them. All types in Redivis support NULL
values; that is, an empty cell.
A string can be used to store any textual data (UTF-8 encoded). Moreover, a string is a "universal" data type — any other type can be converted to a string without error.
Size: 2 bytes + UTF-8 encoded string size (1 byte for ASCII characters).
A 64-bit signed integer. Supports any integer value between -9,223,372,036,854,775,808
and 9,223,372,036,854,775,807
.
Size: 8 bytes per cell.
Size: 8 bytes per cell.
A representation of either TRUE
or FALSE
Size: 1 byte per cell.
Represents a calendar date independent of time zone. Supports any value between 0001-01-01
and 9999-12-31
Size: 8 bytes per cell
Format:
'YYYY-[M]M-[D]D'
YYYY
: Four digit year
[M]M
: One or two digit month
[D]D
: One or two digit day
Represents a year, month, day, hour, minute, second, and subsecond, independent of timezone. Supports any value between 0001-01-01 00:00:00
and 9999-12-31 23:59:59.999999
Size: 8 bytes per cell
Format:
YYYY
: Four digit year
[M]M
: One or two digit month
[D]D
: One or two digit day
( |T)
: A space or a T
separator
[H]H
: One or two digit hour (valid values from 00 to 23)
[M]M
: One or two digit minutes (valid values from 00 to 59)
[S]S
: One or two digit seconds (valid values from 00 to 59)
[.DDDDDD]
: Up to six fractional digits (i.e. up to microsecond precision)
Represents a time, independent of a specific date. Supports values between 00:00:00
and 23:59:59.999999
Size: 8 bytes per cell
Format:
[H]H
: One or two digit hour (valid values from 00 to 23)
[M]M
: One or two digit minutes (valid values from 00 to 59)
[S]S
: One or two digit seconds (valid values from 00 to 59)
[.DDDDDD]
: Up to six fractional digits (i.e. up to microsecond precision)
Size: cell-content dependent
Format:
Some types can be implicitly converted within a query (e.g., 1 (integer) < 2.2 (float)
), in other circumstances, you will explicitly need to convert the type of a variable before performing certain operations (e.g., "1.0" (string) < 2.2 (float)
).
Redivis supports a wide variety of data types, sources, and upload methodologies. When beginning to upload data, you should first ask:
What type of data is it?
Where is the data?
A computer
Google Drive
Box
Google Cloud
AWS
Another table on Redivis
Where is the metadata?
Embedded in the file
In another file
How do I want to perform the upload?
Through the browser (default)
Redivis will try to detect the file type based on the file extension, though you can manually specify the type as needed.
Uploading compressed (gzipped) files:
Generally, you should upload uncompressed data files to Redivis, as uncompressed files can be read in parallel and thus upload substantially faster.
If you prefer to store your source data in a compressed format, Avro, Parquet, and ORC are the preferred data formats, as these support parallelized compressed data ingestion at the row level.
Redivis will decompress text-delimited files, though the data ingest process may be substantially slower. If your file is compressed, it must have the .gz file extension if you're uploading locally (e.g., my_data.csv.gz
) or have it's header set to Content-Encoding: gzip
if served from a URL or cloud storage location.
Most upload types are limited to 5TB per upload. Stata, SPSS, XLS, and Shapefile ZIP directories are limited to 100GB.
A text-delimited file is a file that uses a specific character (the delimiter) to separate columns, with newlines separating rows.
Must be UTF-8 encoded (ASCII is a valid subset of UTF-8)
Quote characters in cells must be properly escaped. For example, if a cell contains the content:
Jane said, "Why hasn't this been figured out by now?"
it must be encoded as:
"Jane said, ""Why hasn't this been figured out by now?"""
The quote character must be used to escape the quote character. For example, the sequence \"
is not valid for an escaped quote; it must be ""
Empty strings will be converted to null
values
Specifies whether the first row is a header containing the variable names. This will cause data to be read beginning on the 2nd row. If you don't provide a header in your file, variables will be automatically created as var1, var2, var3, etc...
By default, an upload will fail if an invalid record is encountered. This includes a record that has a mismatched number of columns, or is otherwise not parsable. If this box is checked, the number of skipped records will be displayed on each upload once it has been imported.
Whether to allow rows that contain fewer or more columns than the first row of your file. It is recommended to leave this option unchecked, as jagged rows are generally a sign of a parsing error that should be remedied by changing other options or fixing the file.
Whether newlines exist within specific data cells (e.g., paragraphs of text). If set to Auto, Redivis will determine the value based on analysis of the beginning of the file.
It is best to only set this value to "Yes" if you know your data contain line breaks, as this will slow down the import and may cause incorrect error reporting.
The delimiter will be auto-inferred based upon an analysis of the file being uploaded. In rare cases, this inference may fail; you can specify the delimiter to override this inference.
Specify the character used to escape delimiters. Generally "
, though some files may not have a quote character (in which case, they must not include the delimiter within any cells).
Cells containing a quote character must have that character escaped.
Typically, the escape sequence character is the same as the quote character, but some files may use a different value, such as a backslash (\).
Naming variables
Variable names are automatically inferred from the source data. They can only contain alphanumeric or underscore characters, and must start with a letter or underscore. Any invalid characters will be replaced with an underscore (_
).
Variable names must be unique within the table. If the same variable is found more than once in any given file, it will automatically have a counter appended to it (e.g., "variable_2").
The max number of characters for a variable name is 60. Any names with more characters will be truncated.
Variable type inference
Please note the following rules:
If all values of a variable are null, its type will be string
Numeric values with leading zeros will be stored as string
in order to preserve the leading zeros (e.g., 000583
)
Data stored with decimal values will be stored as a float
, even if that value is a valid integer (e.g., 1.0
).
Temporal data types should be formatted using the canonical types below. Redivis will attempt to parse other common date(time) formats, though this will only be successful when the format is unambiguous and internally consistent.
Date: YYYY-[M]M-[D]D
DateTime: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]
Time: [H]H:[M]M:[S]S[.DDDDDD]
You can create up to 500 uploads per table, per version. Files will automatically be appended to each other based on their variable names (case insensitive), with the goal of creating one continuous table with a consistent schema.
If a variable is missing in some of the files you uploaded, the values for the missing variable will be set to null for all rows in the upload.
Conflicting variable types
A file may fail to import due to several reasons; in each case, Redivis endeavors to provide a clear error message for you to fix the error.
When transferring a file from your computer (or more rarely, from other import sources), there may be an interruption to the internet connection that prevents the file from being fully uploaded. In these cases, you should simply try uploading the file again.
The Redivis data import tool has been built to gracefully handle a wide range of data formats and encodings. However, errors can still occur if the source data is "invalid"; some common problems (and their solutions) are outlined below.
If you're still unable to resolve issue, please don't hesitate to reach out to support@redivis.com; we'd be happy to assist!
Some SQL databases and tutorials will generate invalid CSV escape sequences by default. Specifically:
The "proper" escape sequence is a doubling of the quote character. For MySQL, this would look like
If you only have access to the invalid file generated by a previous database dump, you can specify a custom Quote Character of \
in the advanced import options, and Redivis will reformat the file as part of the ingest process (Redivis will also auto-detect this custom escape sequence in many scenarios). Using a custom escape sequence may cause data import processing to take a bit longer.
If your data has paragraphs of text within a particular data cell, and the "Has quoted newlines" advanced option isn't set, the data import may fail (see above screenshot for example). Redivis will automatically set this option to true if it identifies a quoted newline in the top ~1000 records of the file, but if quoted newlines don't occur until later, you'll need to make sure to set this option manually for the import to succeed.
While rare, it is always possible that data transfers will be interrupted by the vagaries of networking. If this happens, we recommend simply retrying your upload. If the problem persists, please reach out to support@redivis.com.
The is a modern standard for working with column-oriented geospatial data. If available, this format is the most robust and performant way to ingest geospatial features into Redivis.
Assumes an object with a "Features"
property, containing an array of valid geojson features.
Each feature will be imported as one row, with additional properties mapped to columns in the table. Nested properties will be flattened using the .
separator.
Note that Redivis only supports 2-dimensional, unprojected () geometries. Other projections might cause the import to fail, and any extra dimensions will be stripped during ingest.
See for more information.
Will be internally converted to .geojson (via ), and then imported as specified above.
Will be internally converted to .geojson (via ), and then imported as specified above. Note that the shapefile must use the (aka EPSG:4326) projection.
If you have additional files associated with your shapefile (e.g., .shx, .proj, .dbf), create a .zip of this folder and import according to the .shp.zip
specification below.
Many shapefiles will be collocated with additional files containing metadata and projection information. These files are often essential to parsing the shapefile correctly, and should be uploaded together.
To do so, create a .zip directory of the folder containing your .shp file and supplemental files. The zip file must end in .shp.zip
. These will then be converted to .geojson (via ), and imported as specified for the .geojson format.
If projection information is available, the source geometries will be reprojected into . If no projection information is available, your data must be projected as WGS84, or the import will fail. Note that only one layer can be imported at a time. If you have directory containing multiple shapefiles, create a separate .shp.zip for each layer.
Files are data entities uploaded to on Redivis. Files are used to store non-tabular (a.k.a. unstructured) data of any file type. You can view a dataset's files by clicking on the Files tab of any dataset or in a workflow.
If you have a file type that you think should be supported that isn't, please !
You can view on the Redivis Demo organization.
By representing files within an index table, we can query and subset the files , while joining them with other tabular metadata. In this way, file index tables behave like any other tabular metadata on Redivis.
To do deeper analysis of files, we can for further analysis. Consult the documentation for your preferred programming language to learn more:
Variables are present in all on Redivis. Conceptually, they represent something that is being measured in the data. When viewed in the table cells view, variables are represented by the table's columns.
In order to view variables, you must have the source dataset(s) for the given table.
These characteristics will only be considered in search results if the user has to the underlying dataset.
Redivis automatically computes certain statistics for each variable, depending on that variable's type and number of distinct values. To view summary statistics, click on a variable in a .
The box plot uses the BigQuery method, which will estimate quantiles () for larger datasets. As such, the box plot values should only be used as an approximation.
When working in bulk, you can also .
You can export your metadata at any time (in either of the above formats) by clicking Export file in the variable metadata editor. Variable metadata is also available in JSON format via the .
A 64-bit (double precision) decimal value.
All values in brackets are optional. If your Date is not in this format, you can convert it using . Redivis will attempt to convert other , when possible.
All values in brackets are optional. If your DateTime is not in this format, you can convert it using . Redivis will attempt to convert other , when possible.
All values in brackets are optional. If your Time is not in this format, you can convert it using . Redivis will attempt to convert other , when possible.
Represents a Point, Multipoint, Linestring, MultiLinestring, Polygon, or MultiPolygon geometry, as specified by the .
You can create or manipulate geographic variables using various .
The is used to show cell values, though each geography cell can also be viewed on a map by clicking on that cell within a table.
Variable types can be converted within the interface in a . The following conversions are supported:
Getting data into Redivis is the first and most important step in making it accessible for your research community. Redivis is designed to make it easy for you to securely ingest data, at scale. To begin the data upload process, first navigate to the .
For a guided walkthrough of uploading data to a dataset, please see the guide.
For a guided walkthrough of how to clean tabular data, please see the guide.
In order to create new , or update existing tables with new data, you will need to ingest a valid tabular data file from a . Redivis supports numerous tabular data formats, with robust error handling and . You can also perform scripted imports via the .
Tables on Redivis can be made up of one or more tabular file uploads. The use case applies when data with the same general schema is broken up across files. For example, a dataset where there is a separate file for each year. In general, it's best to combine such files into a single table, as it is easier for researchers to query a single "tall" table than multiple tables (and Redivis's high-performance query engine keeps these queries quick, even at terabyte scale).
Full documentation on the limits for upload file size, max variables, and other parameters are .
All values of a variable must be compatible with its type. Redivis will automatically choose the most specific, for a variable, with string
being the default type.
If files have conflicting across a given variable, the lowest-denominator type for that variable is chosen when the files are combined.
Data invalidity is most common when uploading text-delimited files, though it can happen with any file format. While some data invalidity errors may require further investigation outside of Redivis, others may be due to provided in the file upload process. When possible, Redivis will display ~1000 characters that are near the error in the source file, allowing you to identify the potential source of failure.
file_id
A unique, system generated identifier for the file. Use the file_id to reference and download specific file(s) in your analysis.
file_name
The name of the file
size
Size of the file in bytes
added_at
Timestamp for when the file was added to this folder
md5_hash
MD5 checksum of the file contents, encoded as a base64 string
Field
Notes
Name
The name of the variable, limited to 60 characters. Must be unique within the table. All variable names are case-insensitive, and can only use alpha-numeric characters and underscores, and cannot start with a number.
Type
Required. The data type of the variable. Learn more about variable types.
Label
Optional. A short, human-readable description of the variable name. Limited to 256 characters.
Description
Optional. A longer space for notes about the variable's creation methods, coding, or supplementary information. Limited to 5000 characters.
Value labels
Optional. A map of each value in the data to a longer string of information. For example, a variable with records 0
, 1
, and 2
might have value labels for 0 = No
, 1 = Yes
, 2 = Don't know
.
Value labels will be shown inline on the cells view of a table, alongside the frequency table values in the univariate statistics.
Statistic
Description
Count
Distinct
The number of unique values in the table. Does not include null as a value. (For example, a variable with values 0, 1, and null will have a distinct of 2.)
Non-null
The percentage of values in the table which are not null. Calculated by dividing the Count by the total number of records in the table.
Min, max
For continuous variables (integer, float, date, dateTime, time): the minimum and maximum of this variable.
For string variables: the minimum and maximum length of all values.
Min lng, Min lat, Max lng, Max lat
For geography variables: the bounding box containing all geospatial data in a variable
μ, σ
The mean and sample (unbiased) standard deviation of the variable. Only available for continuous variables (integer, float, date, dateTime, time).
Integer
Float
Returns a close but potentially not exact FLOAT64 value.
Integer
Boolean
Returns FALSE
if x
is 0
, TRUE
otherwise.
Float
Integer
Returns the closest INT64 value. Halfway cases such as 1.5 or -0.5 round away from zero.
Float
String
Returns an approximate string representation.
Date
DateTime
Returns a DateTime at midnight on the corresponding date. For example, if x
is 1970-01-01
, returns 1970-01-01 00:00:00
DateTime
Date
Returns the part of the DateTime which is the calendar date. Note that this will not round the DateTime to the nearest Date. E.g., if x
is 1970-01-01 23:59:59
, returns 1970-01-01
.
DateTime
Time
Returns the part of the DateTime which is the clock time. E.g., if x
is 1970-01-01 23:59:59
, returns 23:59:59
.
Boolean
Integer
Returns 1
if x
is TRUE
, 0
otherwise.
Boolean
String
Returns "true"
if x
is TRUE
, "false"
otherwise.
String
Float
Returns x
as a FLOAT64 value, interpreting it as having the same form as a valid FLOAT64 literal.
Also supports casts from "inf"
, "+inf"
, "-inf"
, and "nan"
.
Conversions are case-insensitive.
String
Boolean
Returns TRUE
if x
is "true"
and FALSE
if x
is "false"
All other values of x
are invalid and throw an error instead of casting to BOOL.
STRINGs are case-insensitive when converting to BOOL.
String
Geography
Use ST_GEOGFROM
new variable method.
Geography
String
Use ST_ASTEXT
new variable method.
.csv, .tsv, .psv, .dsv, .txt, .tab, *
Delmited
Redivis will auto-infer the delimiter, or you can specify it manually. This is also the default format for files with missing file extensions.
See working with text-delimited files
.jsonl, .ndjson
JSON-lines (a.k.a. Newline-delimited JSON)
A newline-delimited list of JSON objects, with one object on each line. Each objects' keys correspond to the variable names in the table.
.json
JSON
Must be a JSON array of objects, where each top-level object represents one row in the table. The keys of each top-level object correspond to the variable names in the table.
Importing newline-delimited JSON (see above) will be significantly faster and is recommended for larger files. If your file is formatted as GeoJSON, but has the ending .json, make sure to explicitly choose "GeoJSON" as the file format.
.avro
Avro format
Compressed data blocks using the DEFLATE and Snappy codecs are supported.
Nested and repeated fields are not supported.
.parquet
Parquet format
Nested and repeated fields are not supported.
.orc
Orc format
Nested and repeated fields are not supported.
.sas7bdat
SAS data file
Default formats will be interpreted to the corresponding variable type, and variable labels will automatically be imported.
User defined formats (.sas7bcat) are not support.
.dta
Stata data file
Variable labels and value labels will automatically be imported.
.sav
SPSS data file
Variable labels and value labels will automatically be imported.
.xls, .xlsx
Excel file
Only the first sheet will be ingested. The legacy .xls format will have all dates and times represented as dateTimes. Due to the variability of excel files, and inconsistencies in how excel internally represents dates, this formatted is typically not recommended if other options are available.
Google sheets
Sheets file stored in Google Drive
Only the first tab of data will be ingested.
In addition to uploading data through the browser interface, you can leverage the redivis-python library, as well as the generic REST API, to automate data ingest and data release pipelines. These libraries can be used for individual file uploads similar to the interface, as well as for streaming data ingest pipelines.
Consult the complete client library documentation for more details and additional examples.
Redivis determines variable names and types during data upload. Additionally, it will automatically parse certain metadata based on the uploaded file format:
SAS (.sas7bdat): labels
Stata (.dta): labels and value labels
SPSS (.sav): labels and value labels
For other file types (e.g., csv), you will need to augment the metadata directly. To apply metadata in bulk, you can upload a file containing metadata information directly from your computer. This file can either be a CSV or JSON.
The CSV should be formatted without a header, with each row corresponding to a variable, with column 1 as the name
, 2 as the label
, 3 as the description
. If the variable doesn't have a label or description, leave these columns empty.
For example:
When uploading a JSON file, specify the name
, label
, description
, and valueLabels
using the appropriately named attributes in the object corresponding to each variable. If the variable doesn't have a label, description, or value labels, you don't need to include these attributes.
For example:
All datasets on Redivis are automatically versioned. Any change to a dataset's data in tables or files will require a new version. A version is a locked copy of the data, supporting future reproducibility and confidence in the persistence of researchers' data workflows.
Changes to documentation and metadata do not create a new version, though different versions do have independent documentation and metadata. For example, if a new version contains a new table, or new data, you will likely want to document this information separately from the previous version. However, if you only want to enrich the existing metadata on the current version, you can do so without creating a new version.
If you are a dataset editor and have new data, find a mistake to correct, or would otherwise like to modify the existing content of a dataset, you can create and release a new version.
To help researchers better understand if the differences across versions, Redivis uses semantic versioning, of the form v[major].[minor]
The first version of every dataset is v1.0
. For subsequent versions, the tag will augment automatically depending on the changes being released.
Major update: Existing code may not run.
Triggered when variables in the new version are renamed, deleted, or retyped. Also occurs if any tables from the previous version were deleted.
Minor update: Existing code will generally run.
Changes are limited to adding / removing records, recoding variables, adding variables, and adding tables.
On any dataset page, you can view the current version tag next to the dataset title, and click on this tag to switch to view a full version history and switch to a different version of this dataset.
Within a workflow, you can change the version of a dataset by selecting the dataset node and clicking the Version button at top right. If there is a new version available, the dataset node will be highlighted to indicate that you might want to upgrade.
When it's time to update a dataset's data, you'll want to create a new version. To do this, navigate to the dataset editor and click Create next version.
Before this version is released, it will be tagged as next
. Only dataset editors will be able to see the next
version on the dataset page and use it in their workflows.
All versions of a dataset contribute to that dataset's total size, which in turn will count towards your usage quotas or organization billing (depending on whether the dataset is owned by you or an organization).
This total size will be displayed in the dataset editor, alongside the size for the current version. For datasets with one version, this total size may be slightly larger than the current version, as Redivis stores certain metadata behind the scenes to support future versioning.
As new versions are created, Redivis will efficiently compute a row-level difference between the versions — only additions and updates to existing data will contribute to the dataset's total storage size, preventing data that is consistent across versions from being double-counted.
Adding or removing columns (or changing column types) won’t affect row uniqueness, as the underlying storage represents all values as strings. Only the storage size of the new column would be added.
If the most recent version of a dataset has been released in the last 7 days, and there is no Next version already created, you'll have the option to unrelease it.
This will revert the dataset to the exact state it was in before the version was released. If anyone who is not a dataset editor has this version in a workflow, they will lose access to the data, though can revert to a previous version if it exists.
Any version of a dataset can be deleted as long as it is not the most recent released version. Deleting this version will permanently delete all metadata and data associated with it.
This version will no longer be available in any workflows, and researchers will lose access to any data referencing that table in their workflows. In order to continue with the dataset, researchers will need to change their analyses to a non-deleted version.
Deleting a version is permanent and can't be undone.
If you are deleting versions to reduce storage costs, be aware that Redivis stores data efficiently across versions – the storage used by a particular record will be deleted only if it is unique to the deleted version (or, if deleting a series of versions, if that record doesn't exist in any non-deleted version).
When a dataset is first created it will be marked as unreleased.
As you work on the initial version of the dataset you can see the changes you make to the data in the version history modal. in this unreleased state you can add it to a workflow and analyze the data it contains. Note that the only people who can see it in the workflow are dataset editors. If you make changes to the data or delete it, your workflow will change instantly to reflect those changes.
Once you are ready to make your data available to non-editors, you will need to publish it. This will simultaneously release your version as well as make your dataset discoverable. As soon as it is published it will be accessible to all who meet its access rules.
If this is an organization dataset, it can be unpublished at any time. This will "unlist" the dataset so that it will not be listed on your organization home page and even users who meet the access requirements will not be able to see it. Anywhere it is used in a workflow it will instantly be made unavailable.
Unpublishing might be used if you need to temporarily halt usage of the dataset but don't want to disrupt all of the access rules.
When analyzing data in a workflow you can change any dataset to a previous version to instantly change every table to the corresponding table in that version. To do this, click on the dataset and click the Version button in the toolbar. Select the version you'd like to analyze and confirm. You can always switch versions at any point.
Normally you can only have one copy of dataset in a workflow but it's possible to add a second version if you'd like to compare them. In any workflow, click the + Add data button and locate the dataset. Right click on the dataset and choose to add another version to the workflow.
You can create folders on the grey bar on the right side of the Files tab in a dataset. Folder names must be unique. You can click on the ⋮
menu next to any folder name to manage that folder.
To move files between folders you can right click on an individual file to change its folder. If you have multiple folders you can click the Move files button below files in the right bar to move files between folders.
To configure sampling on your dataset, click the Configure sample button on the Tables tab of a dataset page.
As a general rule, you should only use random samples if:
You have one table in your dataset, or
Researchers won't be joining multiple tables in your dataset together
If this isn't the case, consider sampling on a specific variable. Otherwise, as researchers join different tables together, they will start getting samples of a sample, since there is no consistent cohort of records between tables.
For situations when you want researchers to be able to join tables within your dataset, consider generating a sample on a variable that exists in at least some of the tables in your dataset. Every value for this variable will have a 1% chance of being in the output set.
Importantly, this sampling is deterministic. This guarantees that the same values that fall in the 1% sample for one table will also occur in the 1% sample for another table in the same dataset. In fact, these sampled values will be consistent across Redivis, allowing researchers to even merge samples across datasets.
When sampling on a variable, only tables with that variable will be sampled. This is useful for the case when some tables contain supplementary information to your primary cohort. For example, consider the case when your dataset has a "Patients" table, a "Hospitalizations" table, and a "Hospitals" table. We'd likely want to create a sample on the patient_id
variable, which would create a 1% subset of patients and the corresponding hospitalizations for those patients. However, this wouldn't create a sample on the "Hospitals" table — which is what we want, given that the sample of patients is still distributed across a large number of hospitals.
If only some of the dataset's tables are sampled, users with sample access to the dataset will have data access to the sampled tables and data access to the unsampled tables. While this is likely necessary for researchers to meaningfully work with the dataset sample (see paragraph above), it may have ramifications for how you configure your access rules.
When you the enable the data source, you'll be prompted to log into the corresponding account. Redivis will only ever read data from these sources when explicitly requested, and it will never modify or overwrite content.
Once configured, you'll see any added data sources appear as an option when uploading data.
You may import any object that you have read access to in GCS by specifying a bucket name and path to that object, in the form /my-bucket/path/to/file
. You may import multiple objects at once by providing a prefix followed by wildcard characters, e.g.: /my-bucket/my-folder/*
.
The following wildcard characters are supported:
* : Match any number of characters within the current directory level. For example, /my-bucket/my-folder/d*
matches my-folder/data.csv
, but not my-folder/data/text.csv
** : Match any number of characters across directory boundaries. For example, my-folder/d**
will match both examples provided above
? : Match a single character. For example, /my-bucket/da??.csv
matches /my-bucket/data.csv
[chars] : Match any of the specified characters once. For example, /my-bucket/[aeiou].csv
matches any of the vowel characters followed by .csv
[char range] : Match any of the range of characters once. For example, /my-bucket/[0-9].csv
matches any number followed by .csv
You may import any object that you have read access to in S3 by specifying a bucket name and path to that object, in the form /my-bucket/path/to/file
. You may import multiple objects at once by providing a prefix followed by a wildcard character, following the same syntax and rules as outlined for Google Cloud Storage above.
You may import any file of valid format that you have stored within your Drive, including Google Sheets. Upon choosing as your import source, a modal will open that will allow you to browse and select files from your Google Drive.
You may import any table that you have read access to in BigQuery, including views, materialized views, and external tables. You must specify the table in the form project.dataset.table
. To import multiple tables within a dataset, you may use wildcards. E.g., project.dataset.*
or project.dataset.prefix*
.
You may import any file of valid format that you have stored within Box. Upon choosing as your import source, a modal will open that will allow you to browse and select files from your Box.
Coming soon. Please contact support@redivis.com if this integration would be helpful for your use case so that we can prioritize.
You must both have data access and the ability to export any table or file that you import.
To export data to BigQuery, click Export, and then fill out the subsequent modal.
You'll need to authenticate a Google account, and then specify the BigQuery project, BigQuery dataset, and BigQuery table.
Once you confirm the export, your data will begin transferring to the specified BigQuery location and will provide a live-updating status of the transfer.
In order to upload unstructured data to a you'll need to upload it as a on the Files tab in the dataset uploader.
It is possible to upload tabular data as a file, but it will not be represented as a . This means it is not possible to preview the variable statistics, cells, or query interfaces, use it in a transform node, or control on different levels, and is generally not recommended.
From the Files tab of the , click the "Upload files" button to import one or many files. You can import files from a variety of , or perform uploads
Limits for upload file size and max files per dataset are .
When preparing your upload you will need to select a destination folder. All files must be in one folder. These folders will have a corresponding which will allow researchers to more easily work with the files in bulk. If you are uploading many files, your folder structure will likely be important – for example, a different folder for different categories of imaging files.
For datasets with large tables, it is often a good idea to include a 1% sample of the data, supporting faster exploratory queries as new researchers work to understand your data. Moreover, if a sample is configured, you will have the ability to separately from the full dataset.
Sampling is applied independently to each version of a dataset. You may modify the sampling methodology on a version at any time — even after it's been released — though keep in mind that this may affect researchers that are currently working with the dataset sample. As a best practice, it's good to configure and validate your sample before .
The simplest form of sampling, this will create a corresponding sample for every table in the dataset (including ). Every record / file will have a 1% chance of occurring in the sample.
If your dataset contains , you probably want to sample on either the file_name or file_id variables.
Learn more about controlling sample access in the .
By default, you may upload data from your local computer, a public URL, or from another dataset or workflow on Redivis. However, Redivis supports numerous integrations for data ingest across common sources. You'll need to enable data sources in your settings in order to import data they contain.
You can import any table, which can be particularly helpful with ETL workloads where you want to import a cleaned version of your data (). You can also import any into your table, supporting workflows where tabular data is initially loaded as a file before being loaded into a table.
You can reference any table on Redivis using the form user|organization.dataset|workflow.table
. That is, specify its owner (a user or organization), its containing entity (a dataset or worfklow), as well as the table name, separated by periods.
All files on Redivis belong to a . To import a file, first specify the index table, followed by a forward slash (/
), and then the file name. E.g.:
To import multiple files at once, you can use wildcard characters, following the same pattern rules as specified for . E.g.:
The table export modal Integrations tab allows export of Redivis tables to .
Redivis is not a closed ecosystem — its foundation is based on open APIs and standards, and it is designed to work and integrate with the wide variety of open source (and proprietary) analytical toolkits used by the research community.
Redivis supports data downloads in numerous formats, allowing you to quickly load a table into the software of your choice. Moreover, its integrations, Python and R client libraries, and REST API allow you to interact with and build upon Redivis data in the tools that are familiar and best suited to your use case.
You can interface with Redivis data programmatically using the Redivis R, Python, and Javascript client libraries, which allow external environments using these programming languages to connect to Redivis, facilitating integration with a wide variety of scientific libraries and paradigms.
These libraries leverage the underlying REST API, which can also be used if you want to connect to Redivis from another programming language.
Note that, if data exports are restricted, you might first need to gain approval to export the table before exporting to an external environment.
The Programmatic tab in the table export modal contains a code snippet to help you get started in working with Redivis data via Python. E.g.:
View the full redivis-python documentation here:
The Programmatic tab in the table export modal contains a code snippet to help you get started in working with Redivis data via R. E.g.:
View the full redivis-r documentation here:
The Redivis REST API provides an intuitive, resource-based approach to programmatically interfacing with Redivis. It opens up the ability to query, manipulate, and export data (and other Redivis resources) from within any programming language, supporting the workflows and tools that are familiar and best suited to your use case.
View the full REST API documentation here:
The table export modal Integrations tab allows export of Redivis tables to Google Cloud Storage.
To export data to Cloud Storage, click Export, and then fill out the subsequent modal.
You'll need to authenticate with a Google account, and then specify the Google Cloud Storage Bucket, Table name, and an optional prefix. In addition, you can choose one of several file types.
Once you confirm the export, your data will begin transferring to the specified Cloud Storage location and will provide a live-updating status of the transfer.
The table export modal Integrations tab allows visualization and analysis of your Redivis tables in Google Data Studio. Data Studio provides comprehensive tooling for creating interactive visualizations and dashboards.
Using the Redivis Data Studio connector, data from your Redivis table is dynamically mirrored in Google Data Studio, for live updating of visual dashboards according to your latest data.
The sample dashboard below was generated using the US Fires dataset hosted on Redivis.
To create a new data source, click this link to navigate to the Redivis Connector. You can also search for "Redivis" when creating a new data source to select the Redivis connector.
Click the authorize buttons to 1) allow your Google account to interface with the Redivis connector; and 2) allow Data Studio to access your data in Redivis.
Finally, specify the table that you want to import. If the table belongs to a workflow, you should provide the name of the workflow and the username of the workflow owner. If the table belongs to a dataset, you should provide the name of the dataset as well as the username of the dataset owner.
Data Studio allows you to share your report with other users, and even embed your report on external websites. If you are sharing restricted data, it is important that you set the "Data credentials" to "Viewer". This will enforce that every viewer of the report has access to the underlying Redivis table. If the credential is set to "Owner", the data in the report will be visible so long as you (the owner) have access to the data in Redivis.
Finally, dataset owners can assign export restrictions to their datasets. Certain tables may be bound by these export restrictions and cannot be imported into DataStudio (for example, if the data can only be exported to a specific compute environment, or if there are export size limits that the table exceeds). In this case, you can return to Redivis and request an export exception for the table if appropriate.
Data Studio provides a point-and-click interface that can allow you to develop elegant, interactive visualizations in minutes. You can leverage a comprehensive library of default visualizations, or tap in to the growing body of community visualizations.
Some researchers may want to dive right in, though we recommend taking a look at the DataStudio help documentation for some helpful resource in getting started.
Through the table export modal, you can easily export full data in several common file formats, as well as the table metadata.
In order to download a table on Redivis, navigate to that table on the dataset page or within your workflow. Click the Export table button at right to open the table export modal.
The export modal gives you options to download the table and its metadata in common formats, as well as using integrations or other programmatic interfaces. You can also export the entire SQL that was generated to create the table (see More menu in the top right of the workflow), facilitating reproducibility and reference in other systems.
If there are any restrictions around downloading the table or dataset, they will be listed here. If a restriction is preventing you from downloading the data, you will be able to apply for approval, if allowed.
You can download any table on Redivis assuming that: 1) you have data access to the table; 2) there are no export restrictions on the dataset, or your export is not in conflict with these restrictions.
To download the table's data, select your file type and click the Download button. Depending on the size of the table, and whether it was recently downloaded, it may take anywhere from a few seconds to several minutes to generate the file.
If this is an index table that contains unstructured data, you will also be able to download a ZIP of all the files represented within that table.
Redivis supports the following export formats:
CSV
JSON-lines
Avro
Parquet
dta (Stata)
sav (SPSS)
sas7bdat (via CSV + load script)
Exports are subject to the corresponding export quotas and limits >
As a general rule, be cautious when using Excel to open .csv files. Never re-save a .csv that’s been opened in Excel, as the data may be modified.
On every table, you can download a tables metadata as a structured JSON file, containing information about the table as well as all its variables with their names, types, labels, descriptions, and value labels (when present). The schema of this file is the same as that returned by the table.get API endpoint. Use the Download metadata tab of the table export modal, shown below.
If your table is in a workflow, you can also download a SQL file (via the More menu at the top right of the workflow) that contains the complete code used to generate your output. This code represents a full traversal of your workflow tree, and captures the exact statement necessary to generate the output table at that point in time.
Redivis workflows are high-performance, collaborative environments for analyzing data. In a workflow, you can combine datasets across Redivis, and build out your analytical workflow in a reproducible manner. Workflows can contain:
Data source nodes which contain data across Redivis that you want to work with and have access to. These are usually datasets and can also be other workflows.
Transform nodes to join, filter, aggregate, and reshape tabular data. Transforms are particularly important when working with large tables, allowing you to query billions of records in seconds.
Notebook nodes to leverage the variety of powerful tools available in Python, R, Stata, and SAS in a high performance compute environment.
Table nodes created as outputs of your work that can be used to validate your queries, build further analysis, create new Redivis datasets, or be exported to other environments.
Workflows can be shared with collaborators as a version controlled artifact of your research or to collaborate with others in real time.
Navigate to any table on Redivis and click the Export table button, then navigate to the Embed tab.
Here you can copy the embed URL and paste it into any site that supports embed links. You can also copy code to use for embedding this table on your own page.
At any point you can click the View on Redivis link in the bottom right corner to view the original table on its original Redivis dataset page.
You can also view this table fullscreen by clicking the fullscreen icon on the bottom right overlay.
Redivis can be embedded on any site that allows for embed links, which will allow someone to interact with the live table and underlying data (assuming proper data access), including viewing summary statistics, exploring the cells, and querying the data.
Public data will always be accessible when embedded to anyone viewing the page. However if the data is restricted, normal data will apply.
In order to view restricted data, users must have a and be signed in. If they need to sign in, they can do so when viewing the dataset in it's embedded context.
Once logged in, if they already have to the table it will become available to view and query. If not, they'll need to apply for access on the Redivis dataset page by clicking the Apply for access button.
Workflow documentation is viewed and edited from the workflow overview. To access the overview, click the workflow name in the middle of the black menu bar or click on the white space in the background of the workflow tree.
Rename your workflow by clicking the header at the top of the workflow information view. Workflow names will be visible to anywhere this workflow is referenced.
You can add a short description of your workflow here that will be visible in workflow previews across Redivis, including if it is featured on an organization home page.
View and update information about this workflow's creators and contributors, citation, and any related identifiers detected by Redivis automatically or added by a workflow editor. This is also where you can issue a DOI for this workflow.
This section automatically defaults to displaying the owner of the project. Workflow editors can add or remove anyone from this list. Anyone included here will be added to the citation generated for this workflow.
This section automatically includes anyone who edited this workflow. Workflow editors can add or remove anyone from this list.
This section shows the automatically generated citation for this workflow in your chosen format. This can be copied or downloaded for use elsewhere.
Changes made to the "Creators" field will be reflected in this citation. Any DOI issued for this workflow will automatically be included in this citation.
This section automatically includes any datasets or workflows referenced by this project, including data sources, study collaboration, or what this workflow was forked from. Workflow editors can add any related identifiers from outside of Redivis through links or DOIs, including DMPs, papers referenced, and more.
You can launch a bibliography which displays the citation of this workflow and every related identifier listed above.
Document the details of your research aim or data cleaning strategy. You can also embed links or images.
You can give other users access to view or edit your workflow, or transfer ownership to another user in the Sharing section. You can also set this workflow's visibility and discoverability. Anyone viewing your workflow will need to have gained data access to any restricted datasets to view the relevant node contents.
You can add your workflow to a study in order to facilitate collaboration with others. For certain restricted datasets, your workflow will need to be part of an approved study in order to run queries.
You can add up to 25 tags to your workflow, which will help researchers discover and understand it.
You can see the data of last workflow activity, how often the workflow was forked or viewed (if it is a public workflow).
Any workflow editor can issue a DOI (Digital Object Identifier) for a workflow. A DOI is a persistent identifier that can be used to reference this workflow in citations. DOIs are issued through DataCite and do not require any configuration with your own or your organization's DataCite account.
Open the Provenance section and click Issue DOI. Once created, you will be able to see the DOI and view the record on DataCite.
When DOIs are issued they enter a "Draft" status where the identifier is assigned but it has not been permanently created. All DOIs issued for workflows will remain in this draft status for seven days to allow for removal of the DOI.
You can start referencing the DOI immediately while it is still in draft status since the final DOI will not change once it becomes permanent. After the seven day draft period the DOI will automatically become permanent if your workflow is set to be publicly visible.
Since DOIs are intended for public reference, they will not be issued for workflows that remain fully private.
The dataset overview has additional tabs that list all the nodes in the workflow. You can click on any node in one of these lists to navigate directly to it.
These lists can be filtered by their relative location in the workflow, variables they contain, and additional metrics.
A transform node in a Redivis workflow provides a powerful querying tool to reshape one or more tables into a new output table.
You will build your transform by choosing steps one at a time based on the actions you want to take on your data. At any point you can run the transform to execute each step sequentially and view the resulting table.
The most commonly used transforms steps can:
Aggregate a table
Create variables based on existing data
Filter out rows you no longer need
Join additional tables
The transform interface translates your interface inputs to SQL code that you can view and edit. You can also intermix SQL code you've written between interface steps, or build an entire transform of your own SQL code if you wish.
As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress.
Empty
White background
A run has never been attempted.
Executed
Grey background
A run has successfully completed and no changes have been made since then.
Invalid
Black exclamation icon
A run is not possible. This might be because you haven't finished building the steps or have input invalid information.
Errored
Red exclamation icon
A run has finished unsuccessfully. This might be due to an incorrect input you've set that our validator can't catch. Or something might have gone wrong while executing and you'll just need to rerun it.
Edited
Grey hash marks
Changes have been made to the contents of the transform since it was last run. You can either Run this transform or Revert to its previously run state to resolve it. Editing a transform makes the downstream table stale.
Stale
Yellow background
An upstream change has been made. This means the content of the node does not match the content of the node above it.
Queued
Static circular icon
Nodes that are waiting for an upstream run to finish before running themselves. When you run multiple nodes they will be executed in a logical order. Once upstream dependencies are finished, queued nodes will automatically start running.
Running
Spinning circular icon
A run is currently in progress. You can cancel running (or queued) transforms by clicking the Run menu in the top bar and selecting Cancel. If a node is currently running it might not be able to cancel, depending on what point in the process it's at.
Incomplete access
All black background, or dashed borders
You don't have full access the node. Click on the Incomplete access button in the top bar to begin applying for access to the relevant datasets.
A data source node contains data across Redivis that you want to work with and have access to. These are usually datasets but can also be other workflows. These nodes display overview information about the dataset or workflow it represents, and a list of the tables it contains.
You can click on any table to view its contents, or click "Transform" to build a transform on it.
Adding a data source to a workflow will make a copy of that dataset or workflow in the form of a circular node at the top of your workflow tree. You can add data to the workflow by:
Click the + Add data button in the top left of the toolbar within a workflow.
Click the Analyze in workflow button on a dataset page.
Click the Fork button in the toolbar of any workflow.
No workflow can contain two copies of the data source or the same version of the same dataset. You can add a different version of the dataset to your workflow by right clicking on the dataset name in the + Add data modal.
Datasets are the most common Data source you will add to your workflow. They contain the data you want to work with in the original state curated by the data owner. Usually datasets will contain 1 or more tables you can choose to transform or analyze in a notebook.
Some large datasets have 1% samples which are useful for quickly testing querying strategies before running transforms against the full dataset.
If a 1% sample is available for a dataset, it will automatically be added to your workflow by default instead of the full sample. Samples are indicated by the dark circle icon to the top left of a dataset node in the left panel and in the list of the dataset's tables.
All sampled tables in the same dataset will be sampled on the same variable with the same group of values (so joining two tables in the same dataset with 1% samples will still result in a 1% sample).
To switch to the full sample, click "Sample" button in the top right of the menu bar when you have a dataset selected.
Your downstream transforms and tables will become stale, since an upstream change has been made. You can run these nodes individually to update their contents, or use the run all functionality by clicking on the workflow's name in the top menu bar.
When a new version of a dataset is released by an administrator, the corresponding dataset node on your workflow tree will become purple. To upgrade the dataset's version, click the "Version" button in the top right of the menu bar when you have a dataset selected.
You can view version diffs and select whichever version you want to use here.
After updating, your downstream transforms and tables will become stale. You can run these nodes individually to update their contents, or use the run all functionality by clicking on the workflow's name in the top menu bar.
Workflows can be added to another workflow in order to build off existing analysis. You might want to continue an analytical pipeline that you've built elsewhere, or elaborate on someone else's analysis. You will have access to all tables this workflow contains.
As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress.
Dataset type
Dataset icon in the middle of the circle
Workflow type
Workflow icon in the middle of the circle
Sampled
Black circle with 1% icon
Outdated version
Purple background on version number
Incomplete access
All black background, or dashed borders
You don't have full access the node. Click on the Incomplete access button in the top bar to begin applying for access to the relevant datasets.
Table nodes in your workflow can either be generated by running transform, running code in a notebook to output a table, or will be part of a dataset you've added. The main purpose of table nodes in workflows is to store data which you can sanity check and further query.
Derivative tables in a workflow behave quite similarly to dataset tables throughout Redivis, where you can preview cells, view summary statistics, and run quick queries.
All table nodes also have an associated export interface to see different ways that you can use this data off of the Redivis platform.
All table nodes have one upstream parent. You can view the table's data and metadata similarly to other tables on Redivis. You cannot edit or update the metadata here.
You can create multiple transforms to operate on a single table, which allows you to create various branches within your workflow. To create a new transform, select the table or dataset and click the small + icon that appears under the bottom right corner of the node.
After you run a transform, you can investigate the downstream output table to get feedback on the success and validity of your querying operation – both the filtering criteria you've applied and the new features you've created.
Understanding at the content of an output table allows you perform important sanity checks at each step of your research process, answering questions like:
Did my filtering criteria remove the rows I expected?
Do my new variables contain the information I expect?
Does the distribution of values in a given variable make sense?
Have I dropped unnecessary variables?
To sanity check the contents of a table node, you can inspecting the general table characteristics, checking the summary statistics of different variables, looking at the table's cells, or create a notebook for more in-depth analysis.
If you haven't interacted with tables in your workflow for a while, these tables may become archived, which will temporarily limit your ability to view cells and query data in that table. This is done to prevent runaway storage costs, while leveraging the built-in reproducibility of workflows to allow you to unarchive the table and pick up where you last left off.
If a table is archived, you can still see the name, row count, and variable names/types. To access variable summary statistics, view cells, or run transforms downstream of an archived table, you'll have to reconstitute the table by re-running upstream transforms.
Note that the transform immediately upstream (or any additional upstream transforms, if multiple sequential tables are archived) is invalid, you'll have to resolve the invalid state before un-archiving the table.
If you've added a dataset to a workflow that contains files (storage for unstructured data) you will see a table with a File index label in that dataset's list of tables. This is an automatically generated table, where every row represents one file. You can work with this table just like any other in the workflow, but the file_id
variable will remain linked to the files in that dataset for use in a notebook.
Click the Export table button in the table's right panel to see your exporting options and manage any access restrictions associated with data this table is derived from.
As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress.
Empty
White background
A table node will be empty when it contains no data because the upstream node either has not been executed or its execution resulted in no data.
Executed
Grey background
A table node will be grey when it has data that aligns with the contents of its upstream node.
Stale
Yellow background
A node will be stale when an upstream change has been made. This means the content of the node does not match the content of the node above it.
Sampled
Black circle with 1% icon
Incomplete access
All black background, or dashed borders
You don't have full access the node. Click on the Incomplete access button in the top bar to begin applying for access to the relevant datasets.
This data source is a copy of a on Redivis.
This data source is a copy of another on Redivis.
Only possible for dataset source nodes. This means that you are using a 1% of the data. When a dataset has a sample, it will automatically default to it when added to a workflow. You can change this to the full sample and back at any time in the dataset node
Only possible for dataset source nodes. For datasets this means that you are not using the latest . This means that you have either intentionally switched to using an older version, or that this dataset's administrator has released a new version that you can to.
This means that you are using a 1% of the data. When a dataset has a sample, it will automatically default to it when added to a workflow. You can change this to the full sample and back at any time in the .
You can use workflows on Redivis to analyze any type of data, at any scale. Workflows allow you to organize your work into discrete steps, where you can easily validate your results and develop well-documented, reproducible analysis artifacts.
To create a workflow, navigate to a dataset that you're interested in working with and press the Analyze data in a workflow button. If you do not have data access to this dataset you may need to Apply for access first.
You can also create a workflow from the "Workflows" tab of your workspace. Once you've created your workflow, you'll be able to add any dataset that you have access to.
In a Redivis workflow steps you've taken are represented visually on the left side of the screen in a tree layout where you can build nodes and easily see their relationships. Each node type can take different actions. The workflow tree automatically expands and creates a layout of all the nodes in your workflow, keeping everything organized as it grows.
Clicking on a node will allow you to see its information and/or execute changes. Your selected node will have a purple border in the tree and the right panel will update to show that node's contents. You can expand or collapse this right panel by dragging the center pane side to side.
Most nodes in a workflow will be connected to others, and these connections are drawn as lines on the workflow tree. Thicker lines usually indicate a source node or a harder connection, while thin lines will show a join or more casual linkage.
When you click on a node to select it, the lines connecting upstream and downstream nodes will turn purple. For larger workflows this can make it easier to trace where the data in your selected node came from, or see what its eventual outputs are.
Nodes that have a direct source will have a large solid line connecting them.
Data source nodes represent datasets or workflows that have been added to your workflow that represent a copy of a data source elsewhere on Redivis.
Table nodes are either tables within a dataset, or the resulting output table of an upstream computation.
Transform nodes are queries that are used to reshape data by creating new tables.
Notebook nodes are code blocks (and their outputs) which are used to analyze data.
The main way to build your workflow is to add and edit nodes. You will start by adding data to your workflow, then add ways to cut, reshape, and analyze it.
You can click the Add data button in the top left corner of the workflow toolbar to select a dataset or another workflow you want to add to this workflow. This will add a copy of the selected data source to the workflow and allow you to reference its tables.
Each data source can only be added to a workflow one time. You can add multiple data sources in bulk from this modal. By default all datasets are added at the current version but you can right click on the dataset in this modal to select a different version to add.
Add data editing nodes
All data cutting, reshaping, and analysis on Redivis happens in either a transform or a notebook node. These nodes must be attached to a source table, so can only be created after you've added a dataset.
To create a work node, click on a table and select either the transform or notebook button that appears beneath it. If the table already has a downstream node you can press the plus icon beneath it instead.
There are two mechanisms for working with data in workflows: transforms and notebooks. Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.
Transforms are better for:
Reshaping + combining tabular and geospatial data
Working with large tables, especially at the many GB to TB scale
Preference for a no-code interface, or preference for programming in SQL
Declarative, easily documented data operations
Notebooks are better for:
Interactive exploration of any data type, including unstructured data files
Working with smaller tables (though working with bigger data is possible)
Preference for Python, R, Stata, or SAS
Interactive visualizations and figure generation
You can right click on any transform or notebook node in the workflow tree to copy it. Right click on any table node to paste the copied transform or notebook. The table you pasted it onto will become this node's source table. You may need to update the node it's new location's information for it to work properly.
If you would like to place a node between other nodes you can click on a transform or notebook and select Insert transform.
If you have a transform copied to the clipboard you can insert it between other nodes by right clicking on a transform or notebook and selecting Paste copied transform above. This will insert both the transform and its output table into the branch of the workflow you've selected.
All transforms can be split at the step level into two different transforms by clicking Split in any step's menu. Additionally, two transforms can be combined into one by right clicking on a table to Remove it.
You might want to split a transform above a tricky step to see what the output table would look like at that point in the process. This can be a key tool in troubleshooting any issues and understanding what might be going wrong.
After splitting a transform to check an output table, the next logical step might be to combine these two transforms back into one again. Or perhaps you have a string of transforms which you no longer need the output tables for and want to reduce the size of your workflow.
Only nodes that you have intentionally created can be deleted or removed from the workflow. Output tables are created by running code and can't be removed.
To delete a node, right click on the node and select Delete.
When deleting a transform or notebook:
The transform or notebook and its output table will be deleted.
If the workflow tree has additional nodes downstream, the transform or notebook and its output table will be 'spliced' out, i.e. the upstream node nearest the deleted transform will be connected to the downstream node nearest to the deleted output table. This deletion will cause the next downstream transform to receive new input variables from the node that's directly upstream which will probably cause it to become stale and possibly invalid since variables it was previously referencing might not exist any longer.
When deleting a dataset:
The dataset and all downstream nodes will be deleted. If additional branches are joined into the branch downstream of the deleted dataset, those branches will be retained up to but not including the transform located in the deleted branch.
Since you can't undo a deletion, you'll receive a warning message before proceeding.
As you make changes in a workflow you will change the status of different nodes connected to it. These changes in status are shown in the left panel of the workflow to help you keep track of any changes.
You can select the Map button on the left side of the workflow toolbar to begin a run of all stale nodes in the workflow. This will execute all transform and notebook nodes in a logical sequence to update the workflow completely.
To shift a node, select the node and click the arrow that appears next to most nodes when selected. Shifting nodes is purely an organizational tool and it has no effect on the data produced in the workflow.
Along with clicking a node to select it, all nodes have a Source and Output navigation button on the right side of the workflow toolbar. You can click this button to jump directly to the immediate upstream or downstream node.
You have a couple options for how this workflow can be reused in other analyses. Click the Fork button in the toolbar to get started.
Add to another workflow
Select this option to choose a workflow you'd like to add this one to. This will be a linked copy that will instantly update as the original workflow is updated. This can be a very powerful tool in building out complex workflows that all reference the same source analysis or cohort.
Clone this workflow
This will create a duplicate copy of the workflow will have a link back to the original workflow on the overview page. It can be helpful to work from an exact duplicate of a workflow if you're a viewing a colleague's workflow and would like to test a similar data manipulation, or if you'll rely on similar data manipulations in a new research effort.
Workflows are continuously saved as you work, and any analyses will continue to run in the background if you close your browser window. You can always navigate back to this workflow later from the "Workflows" tab of your workspace.
A complete version history is available for all transforms and notebooks in your workflow, allowing you to access historic code and revert your work back to a previous point in time.
As you work in a workflow, nodes colors and symbols will change on the tree view to help you keep track of your work progress. More detailed information about each of these states can be found on each node page.
These are the most important workflow-wide concepts to be aware of. Workflows are designed to be iterative and allow you to see outputs and go back to make changes and rerun for new results.
In order to keep track of any edits, transforms that unrun edits will be shown differently on the map than other transforms, with grey diagonal marks.
Any time data is changed, nodes that referenced that data will become stale (yellow background) until they are rerun on the new inputs.
Workflows are made to be a space for collaborative analysis. You can easily share access to your workflow with colleagues.
Any node in a workflow can be annotated with a comment by any workflow editor. Comments are intended to be a space for conversation grounded in a specific area of a workflow. They can be replied to in threads by multiple collaborators and resolved when the conversation is complete.
Multiple users with edit access can be working on a workflow at the same time. When this is the case you will see their picture in the top menu bar alongside your own and a colored dot on the workflow tree to the right of the node they currently have selected. When a notebook is started you will see any collaborators code edits in real time.
The Aggregate step collapses rows that are identical across a set of variables, optionally creating new aggregate variables in the process.
Collapsing on variables test
and date
, and creating new variable average_score
to aggregate data from the score
variable.
There is one collapse block where we will define how the data will be reshaped. On execution:
Your data will be cut to only include the variables chosen in this block.
Duplicates records across the collapsed variables will be dropped.
There can be one or more aggregation blocks where we can capture aggregate information in a newly created variable.
You can successfully collapse your table without creating any new variables in aggregation blocks.
Each aggregation block in the step represents one new variable in your output data.
Aggregation blocks how you can capture information about records dropped in your collapse block.
Collapse block:
Aggregation block(s):
We have test score data recorded per test, student, and date. However we want to know the average score on each test overall.
Variables to collapse on: Of the variables in our table, the only ones we want in our final output are test
and date
, so we select these here. We leave out student
because that information doesn't matter to us anymore, and we leave out score
because we are creating a new variable to replace it with aggregated information.
Name: We give our new variable a descriptive name average_score
.
Aggregation method: We want to average all values in the grades
variable per test, so we choose Average
.
Variable to aggregate: Here is where we choose score
as the variable containing the data we want to Average
. If we had chosen a different Aggregation method, we might have different input fields here to answer.
All variables are removed that aren't collapsed on or used in aggregation.
The Average
is created of the score
where test
and date
(our collapsed on variables) have the exact same values.
The score
variable is removed, since it was not collapsed on.
Then all exact duplicate records are dropped, to create the output.
We can build on our previous example, but say we want to know additionally how many students took each test.
2nd aggregation block
Name: We want to give the new variable a descriptive name unique from our other block (and other variables in our table), in this case test_count
.
Aggregation method: We want to count all values in the test
variable per date, so we choose Count
.
Variable to count: We choose test
as the variable containing the data we want to Count
. Since none of our variables have null entries, we could choose any variable here and get the same result. If we did have nulls, they would not be included in the Count. Conceptually we do not want to only include distinct values in our count so we leave that off
This step can be used to drop duplicated records even in cases where no aggregation happens.
Let's say we have data that we know had duplicate records that we don't need.
Variables to collapse on: To drop all records that are an exact duplicate across all variables, we just need the collapse block with no aggregation blocks. We need to select all variables here and can do so by typing all of them out, or inputting *
.
A Create variables block uses methods to make new variables based on existing data. The method selected will dictate how the block operates.
Creating a new variable for what letter grade each student got on their test.
There will be at least one new variable block where you will define and complete a new variable method.
When multiple blocks exist, the variables will be created in sequence and can reference each other.
Analytic methods are a special category of method that allow for each row to be computed individually. When using an analytic new variable method new tools become available:
A partition segments data based on values in the selected partition variables, and computes the analytic method within those segments separately.
A window defines which records are used to compute the analytic method. Usually this would be accompanied with an order clause.
Examples 3 and 4 below go into more detail for analytic methods.
A simple new variable format is extracting one part of a variable into a new one. In our data, we have a full date including year, month, and day, but we want to extract the year for use elsewhere.
Name: We name our new variable test_month
.
Method: Once we chose Date extract
as our method, new fields appear
Variable to extract: The variable with the data we want to extract from (date
).
Date part: The part of the variable we want to extract. Since date
is a Date type variable the information about date part is stored in the format and can be easily extracted. We choose Month
since that is what we want to create.
The Case
method (if/else) allows us to specify one or more conditions to create the values of the new variable.
For example, we can create a variable capturing the grade of each test in our data.
When run, each row of the table will be evaluated in each section of the case statement (If
, Else if
, Else set to
) until it matches something and is set to the corresponding value.
After the first section where a row meets the criteria, the new variable value will be set and no other sections will be evaluated.
We can use an analytic method to compute a value for each row (rather than the entire table). By using a partition we can define groups of rows to calculate across.
For example, in our grades data we can calculate the average score of each test.
Method: We want to calculate the average so select this analytic method.
Partition: This is where we define the groups in which the average will be calculated.
If one or more variables are selected here, the average for our new score_average
variable will be computed across all rows that are the same in the selected variable(s).
If no variables are entered here, then the average will be computed across the entire table.
We want to average scores from the same test so we select test
here.
Variable to aggregate: This variable contains the data we want to average.
Window: We don't want to define a window in this average so we leave it on All rows in the partition
.
To create the new variable, for each row with the same value in test
, the values in score
will be averaged together. So there are 3 rows with the value quiz
in test
which average to 69. So all rows with a quiz
will show the same average of 69 in our new variable.
We can use a window to calculate a moving average. A window will define how many rows before or after the current row to use when calculating the average.
Partition: We choose not to use a partition in this example since our data does not need to be segmented.
Variable to aggregate: The numbers we want to average are in the score
variable so we choose that here.
Window: We want to create a moving average based on one entry before and after the current row, so we select Rows
here.
Order by: Our data is already ordered by date in this table, but if it wasn't we would definitely need to order on the relevant variable here.
Rows preceding / Rows following: This is where we define how many rows to include in the average.
For each row, one row preceding and following is used to compute the score_average
. So for the first row we average 10 and 10 (since no preceding rows exist, it is excluded). For the second row 10, 10, and 40 are averaged. This process repeats until the end of the table is reached.
Note that we could also use Range instead of Rows for our window if our question was time based (e.g. average score over 1 month preceding and 1 month following).
Continuing the previous example, if we had included a partition then this same process would be completed separately for the values in each partition. So if we had partitioned on test
in this example, our outcome would look different
Since quiz2
is in a separate partition, those rows are averaged separately.
The Filter step will select for rows that meet a certain set of conditions.
Filter out rows with scores less than 70.
Basic state
A filter step will be made up of one or more filter blocks with a completed condition.
When there are multiple filter blocks in a step, conditions in all blocks must be met for rows to be kept.
If you have a more complex filter statement that is dependent on multiple nested conditions you can press the +
button to expand the filter block.
Expanded state
When multiple conditions are needed in a block, you must specify how they relate to each other (AND
vs OR
)
Any nested conditions are resolved before higher level conditions.
Lets say we only want to reduce our table to only contain information about results from the final test.
Variable(s): The variable test
is where the final
value we want to evaluate on is located.
[Operator]: We want fields that exactly match, so we choose =
.
Value(s) or variable(s): We want to only keep rows where final
is present, so we put that here.
Let's say we don't just want values from the final but only those from the final with a score above 60.
We input the data as in the above example, but since we now have two conditions, we have to decided how they relate to each other. In this case we want data that meets all the conditions so select All conditions must be satisfied (AND)
.
Let's say we want to keep all data from the final greater than 60, or any any scores above 85.
We want to keep all rows where scores are over 85 OR are from the final and over 60. So we set up the final and over 60 conditions under an AND
umbrella, and nest that under the OR
umbrella alongside our condition about scores over 85.
When executing, the nested conditions (scores from the final over 60) will be evaluated first to be true or false. Then the higher level condition (scores over 85 OR (scores about the final and over 60) will be evaluated. Any rows that meet this higher level condition will be kept in the output, and any that do not will be discarded.
Let's say we only want to keep rows where scores are greater than the average score, and that are for our selected students. Right now our selected students are Jane and Pat but we know that might change in the future.
With multiple inputs on either side of a condition, the condition is evaluated as an OR. So this condition will evaluate to true if the value in student
equals any value on the list (jane
or pat
).
While we could have input jane
and pat
into the right side of this condition and gotten the same result, using a list makes it easy in the future to change our selected students centrally and update it everywhere the list is used.
For our second condition, we want to evaluate how two values in the same row compare to each other.
When executed, this condition will look at each row to check if the value in score
is greater than the value in score_average
. If so, then the condition is met.
A comparison always evaluates to either TRUE or FALSE.
Comparisons are made up of one or more comparison rows. A comparison row always evaluates to TRUE or FALSE; multiple rows can be nested together with a logical AND/OR.
A comparison row is made up of three components:
The left expression can contain variable parameters from the source table, joined tables, as well as any newly created variables. All referenced new variables must be "upstream" from the current comparison, with the exception of joins, which may reference new variables that are constructed by any variables upstream of that join.
Depending on the operator selected, the left expression may contain multiple variables. In this case, each left expression will be evaluated against the right expression(s), and logically joined via an OR. If you want to take the logical AND of multiple variables, create a separate comparison row for each variable.
Multiple left hand values are only supported for =
and like
operators.
Redivis supports the following operators:
Checks if any the value(s) in the left expression are (not) equal to any values in the right expression. NULL
s are treated as equivalent (NULL == NULL -> TRUE
and NULL != NULL -> FALSE
)
Checks if the value in the left expression is less than, greater than, etc. the right expression. String
comparisons are lexicographically ordered, other data types are based on the numeric / temporal order of that value. Comparisons between NULL
values will always be false.
Checks if the string(s)
in the left hand expression matches specified pattern(s) in the right hand expression. The pattern may contain the following characters:
A percent sign "%" matches any number of characters
An underscore "_" matches a single character
You can escape "\", "_", or "%" using one backslash. For example, "\%"
The right expression can contain any variables allowed in the left expression, as well as literal values and lists. The comparison row will evaluate to TRUE when the left hand expression matches any of the right hand expressions, except for the !=
and !like
comparators, where the comparison will evaluate to true if all values are not equal to / not like the left expression.
To match against a null datum (empty cell), you must specify the special literal value NULL
here.
The Limit step reduces the table to a set number of rows.
Limit to 2 rows
There will be one limit block where you will define your limit.
The limit step is non-deterministic if the result set isn't ordered. Different rows may be returned from subsequent queries.
Let's say we are doing initial exploratory work on data. We did a number of complex steps and want to quickly execute the query so we can iterate.
Some methods are only available for certain variable types, so you might need to variables before you can use them in the method you've chosen.
More methods to choose from can be found in the .
Comparison statements used here will operate the same as they do in the step and can be nested in the same way.
First step, we create a called selected_students
with the values jane
and pat
on it. Then we can select it as compared to student
.
Variables to collapse on
All variables you want to include in your output.
Name
The name of the new variable being created.
Aggregation method
How the new variable will be summarized (e.g. SUM
or COUNT
).
[Method fields]
After you select your aggregation method, you will be prompted to input the information your chosen method needs to execute. You can see more specifics in the Variable creation methods section.
Name
The name of the variable being created. This must follow all naming standards
Method
The way that the new variable will be created. Choosing this will bring up additional fields to complete specific to the chosen method. See all methods here.
Variable(s)
The variable whose values will be evaluated.
Depending on the operator selected, it may contain multiple variables. Use wildcard characters (e.g. *
for bulk selection.
[Operator]
How the two sides of the condition will be evaluated.
Value(s) or Variable(s)
The variable or value which the previously chosen Variable(s)
will be evaluated against.
Depending on the operator selected, it may contain multiple variables, values, or a value list. Use wildcard characters (e.g. *
for bulk selection.
Limit
The number of rows you would like your output data to contain.
A transform is a series of steps detailing a data transformation that are executed in sequential order when the transform is run. These steps and a final variable selection create an output table that can be inspected and further transformed.
The majority of building a transform is choosing and completing your data transformation steps. Steps represent one specific action (e.g. filter rows, join tables) and are completed in the Redivis transform interface. Every step is a fundamentally a piece of SQL code that can be examined and could be reproduced by running in any environment with the same data.
When running the transform, steps will be executed in order, building off of what directly came before them.
Steps have additional options in the ⋮
menu to help in the building process:
Collapse / expand
Annotate
Disable / enable
Reorder (this might change the outcome)
The final task in constructing a transform is to select which variables you want to have in your output table. You will use the variable selector in the bottom pane of the transform to choose which to keep.
Transforms are iterative and are designed to easily check your progress as you work. As long as a transform is valid, you can press the Run
button to generate your output table. Click on this output table node in the workflow tree to see your results once the run is complete.
The transform interface will prevent you from running a transform that is invalid. This will be displayed with an invalid icon on the step or the part of a step where the issue is. Usually when you have an invalid alert it means some crucial information is missing and you can follow the invalid icons to find the locations you'll need to update.
Every time you run the transform, a snapshot of the transform at that point in time is saved. You can click on History to view the log of these runs, and click on any snapshot to revert the transform to that point in time.
Your transform will then be in the Edited state and no changes will be made to the output table. Once you run it, this configuration will become a new snapshot at the top of the transform history list.
Output tables don't just exist to store data, you can use the variable statistics generated in the table node to make sure your transform did what you expected it to do.
Common things to check might be total number of rows and variables in your output table, distinct and null counts of a key variable, or even running a query on the Query tab of the table to filter by a particular value.
If you find the table isn't what you want, you can go back to the transform to make changes and rerun it. Transforms were created to be iterative!
By default transforms are named with a number based on the order they were created. We suggest renaming your transforms with descriptive names so you and your collaborators can quickly find them later on. You can rename a transform by clicking on the title of the node, or on the Rename option in the transform ⋮ More
menu.
If you have not specifically renamed the connected output table, renaming a transform will also rename this output table.
All transform nodes have one source table. You can join in multiple tables, but the source table designation will affect how the joins execute. You can change the source table from the ⋮ More
menu in the transform. Note that if the new source table has different variables your transform might become invalid until you update it.
All transforms can be split at the step level into two different transforms by clicking Split in any step's menu. Additionally, two transforms can be combined into one by right clicking on a table to Remove it.
You might want to split a transform above a tricky step to see what the output table would look like at that point in the process. This can be a key tool in troubleshooting any issues and understanding what might be going wrong.
After splitting a transform to check an output table, the next logical step might be to combine these two transforms back into one again. Or perhaps you have a string of transforms which you no longer need the output tables for and want to reduce the size of your workflow.
As you go, you might find yourself wanting to take certain data cleaning actions multiple times or move them around in your process. You can right click on any transform in the workflow tree to see options for copying it or deleting it. Right click on any table node to paste the copied transform. You can also insert transforms above other transforms.
Steps and parts of steps can also be copied and pasted within the same transform or across multiple transforms.
All transforms generate SQL, which will executed the transform is run. To view the code for a particular step, click on View SQL in the step menu. You can also see the SQL for the entire transform from within the ⋮ More
menu of the transform.
You can convert this code into a SQL query step and run it to achieve the same outcome, or edit the code directly.
Redivis uses SQL as the basis for data transformation for many reasons. It is designed for working quickly and efficiently with large tables, and is easy to learn concepts and basic executions, and it is highly reproducible.
While Python, R, Stata, and SAS are commonly used in data analysis, they are not suited to working with large tables. Executing queries on millions of records can take hours and in many cases fail. We have provided a point and click interface for using transforms to make using a perhaps unfamiliar coding language easier.
Tables are data structures containing rows and variables
Variables represent fields in the table that all rows contain values for
Rows represent an individual observation in table with values in across all variables
A SQL query will always operate on data from one or more tables and output a single data table.
When you are using the interface or SQL code, it is best to start the process by figuring out what you would like your output table to look like. What are the variables? What do you want each row of the table to represent?
Once you figure out what you want your output to look like, you can work backwards and formulate the steps of your query to build the shape you'd like.
The Stack step will add or remove rows from another table that align with your existing variable structure.
Union rows from Table B.
Stacked table
The table containing the rows you wish to add.
Stack type
How the rows will be added. (More information below.)
Variable alignment
How you will define the variable structure between the source and the Stacked table
.
All matching variables: Will automatically match variables and only include variables with a matching name, the rest will be dropped.
All variables: Will automatically match variables and keep all variables in the output. Values will be marked null
where there is no match for an existing variable
Manually match variables: Will require you to align variables between two tables.
Retype variables
Variables will be retyped in order to be aligned.
Only keep distinct rows
If any added rows are an exact duplicate of an existing row it will be dropped from the output.
Create variable for table name (Union only)
In order to record in the data where new rows came from you can opt to create a new variable with table information.
The most common stack type is a Union, which can be particularly useful when combining data which has been broken up into multiple tables with the same structure.
Appends all of the stacked table’s rows to the source table.
Keep only those distinct rows in the source that match rows in the stacked table.
Keep only those distinct rows that are in the source and stacked table.
Let's say I have data broken out into two tables with the same structure. I want to add all rows together as they are.
Stacked table: My data is in table1
so we select that here
Stack type: We want to add all rows so we select Union
here.
Variable alignment: Somehow our variable names shifted between tables so we can't automatically align tables. We choose Manually match variables
here and then fill in the variable names under Source variable
and Stacked table variable
.
Where the variable names are different, the output will keep the name of the Source variable
when it exists.
To continue the previous example, let's say we have a third table of data, and it has additional information that was not present in the first two about state. We want to keep this information and also keep track of which rows came from which table.
We create one block for the table1 union, and a second block for the table2 union
Variable alignment: Since we want to keep all variables no matter if they have matches, we select All variables
.
Create variable for table name: We check this box since we want to keep track of which table each row came from.
Note that Source table name
is automatically populated with the name of this transform's source table.
Let's say we have a table with information about students and their test scores, and we have identified some students that have dropped the course and that we no longer need in the original table. We have gathered those students in table1
.
The rows that were in both tables are now removed.
Let's say we have two tables full of students, one from who took our fall semester class, and one from the spring semester class. We want to only keep student rows for people who took both classes, so those who are present in both tables.
Only rows that were in both tables now remain.
The Pivot step rotates data values into new variables using aggregation.
Pivot on the test
variable.
There is one pivot block where we will define the pivot variable and values.
There are one or more aggregation blocks where we will define how the table will reshape and what data populates the newly created variables.
There must always be at least one aggregate block.
The total number of new variables in your output table will be the number of Pivot values defined * The number of aggregation blocks.
Pivot block:
Pivot variable
The variable containing the values that will become new variables.
Variables to collapse on
All variables you want to include in your output, except for variables defined elsewhere in the pivot operation (pivot variable
, variable to aggregate on
).
Pivot value
A value from your pivot variable
which will become a new variable in your output table. You can add multiple values. You do not need to include every value of this variable.
New variable name
An option to rename any of the new variables created in the pivot value
field. Leaving this field blank will persist the value name into the variable name.
Aggregation block:
Alias
A value that will be prefixed to your new variable names
. If you are aggregating one variable this field is optional, but if you are aggregating multiple then you'll need to specify a value for each one.
Aggregation method
How the variable to aggregate
will be summarized (e.g. SUM
, COUNT
).
Variable to aggregate
Which variable's data will populate the new variables you are creating in the pivot value
field.
Let's say we've recorded data about tests given and student scores and we want to pivot in order to see information sorted by student.
Starting data:
Input fields:
Pivot variable: Since we want to create new variables based on values in the student
variable, we select that as our Pivot variable.
Variables to collapse on: We want to include all variables here that will be the final shape of our pivoted table that are not already defined as our pivot variable
or variable to aggregate
. We include test
and date
in this example.
Pivot value: We need to choose one or more values from our selected pivot variable (student
) to become new variables in our output table. This does not have to include every value in this variable, even though we use all of them in this example.
New variable name: We don't want to rename any of these new variables, so we can leave this blank.
Alias (optional): Since we only have one variable to aggregate, we can leave this blank.
Aggregation method: We need to choose how our variable to aggregate (score
) will be aggregated. In this example data we don't have duplicated values so what we choose here won't matter. So we choose SUM
to validate the process.
Variable to aggregate: This is the variable (score
) that will be removed from this table, and its values will be redistributed to our newly created variables (jane
, pat
, and sam
) with the aggregation method we selected handling any duplicate values (in this example, SUM
ming them).
Continuing this example, let's say we want to average scores from the two quizzes. We can collapse the table by not including the date
variable. Without that variable there would be two identical values in test
(quiz
) which will be collapsed into one.
In this case it does matter what we select for Aggregation method
, since the duplicate values for quiz will get aggregated. We select Average
here.
We can add another variable aggregation block to gather more information about the variables being aggregated. Continuing this example, let's say we not only want to average the quiz scores, but count the number of averaged quizzes.
Note that the Alias
field is no longer optional since we have two aggregation blocks.
We make sure to select score for our Variable to count
since we specifically want to know how many scores were included in our average. Since we have some null scores, selecting a different variable here might give us a different result.
The jane_scoreavg
variable contains data from the scoreavg
aggregation block's method of Average
for the values of student Jane
, while the jane_scorecount
variable contains data from the scorecount
aggregation block.
A Join step will combine data from two tables based on a join condition so it can be queried together.
Inner join where student
= student
There will be at least one join block where you will define a table to join and a join condition for that table.
When multiple blocks exist, the tables will be joined in sequence.
Join table
The table containing the data you wish to combine with this transform's data. Once selected, this table's variable will be marked t1 (or t2 etc. if other joins exist).
Join type
The way that the tables will be combined. (More information below.)
Source variable
The variable in your source table that you want to join on. All records in this variable will be matched against all records in the Joined table variable.
Joined table variable
The variable in your selected Join table that you want to join on. All records in this variable will be matched against all records in the Source variable.
If a row in either the source or join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.
If a row in the source doesn’t have a match, all joined variables will be null for that row. If a row in the join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.
If a row in the source doesn’t have a match, it will be dropped. If a row in the join table doesn’t have a match, all source variables will be null for that row. If a row is matched multiple times, it will be multiplied in the output. This is the same as a left join, but reversed.
If a row in either the source or join table doesn’t have a match, all source variables will be null for that row. If a row is matched multiple times, it will be multiplied in the output.
Every row in the source table will be combined with every row in the joined table. This might be used to perform a join on a new variable that will be created downstream (such as in a geospatial join). You will almost always need to use a filter after this join for the query to successfully execute.
It is strongly recommended to use a Filter step with a cross join to avoid a massively expanded table.
Depending on the structure of your data, joins might multiply in rows.
If the variable you are joining on has no duplicate values in either the source table or the joined table, then your output table will always have the same or fewer rows than the sum of both table's rows added together.
If the variable you are joining on has duplicates in either table, any time there is is a duplicate row it will be matched for each duplicate.
If the variable you are joining on has duplicates in both tables, any time there is is a duplicate row it will be matched for each duplicate.
Let's say our information about student absenses and test scores are in separate tables and we want to join them together.
Join table: The data we want to join is in Table 1
so we select it here.
Join type: Since we only care about students that have both absence and score information we choose inner join
which will drop any rows without a match in both tables.
Source variable / Joined table variable: We want to match on the variable student
which is present in both tables. So we select our source table (t0)'s student
on the left, and our joined table (t1)'s student
variable on the right.
Since this variable has the same name in both tables, they will be combined into one variable in the output table.
Let's say instead of a table with aggregated absences we have a daily timeliness chart. We want to join the scores for the corresponding attendance information for both the student and date in question.
Join type: Since we want to keep our absence data whether a test was taken or not, we do a Left
join.
Join condition: We join on fields where BOTH the values in the student
and date
variables are a match by selecting both in our join condition.
In some cases you might need to do a more complex join, such as using an operator other than =
, or joining on a condition that uses a variable from your joined table that needs to be retyped or created first. This is common in geospatial joins but might come up in any situation.
To use a more complex join condition you can use a cross join, followed by filter. The cross join will make all possible joins between these two tables and then the filter will limit it to rows matching your condition. Effectively this will execute the same as if you had done your initial join with a more complex condition.
Let's say in the example below we want to do an inner join where T0 student_id
= T1 student_id
but in the source table student_id
is an integer type variable and in Table 1 it is a string. We want to retype the variable in Table 1 before we can join on it, but we can't retype it until we have joined in Table 1. So we will do a cross join, a retype, then a filter to narrow down to only records meeting our condition.
Redivis supports direct SQL queries for advanced functionality and for users who prefer to work in a programmatic interface. You can add one or more SQL query steps to any transform.
This step accepts SQL code and will execute in the order it has been placed within the rest of the steps.
For example:
For example:
You can mix and match SQL steps with other step types within your transform. If you choose to do so, this will introduce some additional language semantics, since transforms allow variables with the same name to co-exist at any given step, delineated in the UI with the t1, t2, ...
and v
prefixes. The v
prefix denotes any newly created variable, while the tN
prefix denotes variables that have been joined from other tables.
In order to reference these variables, you must include the prefix in your SQL code before the variable name. The only exception to this rule is for variables prefixed by t0
— these should be referenced normally, allowing you to ignore these rules if you aren't mixing the SQL interface with other step types.
For example:
For example:
DDL / DML queries are not supported
All queries must return fields of a valid Redivis type. The following BigQuery types are not supported in the result set, though they may be used internally within the query:
TIMESTAMP
NUMERIC
BIGNUMERIC
BYTES
INTERVAL
JSON
ARRAY
STRUCT
BigQuery ML syntax is not supported
BigQuery scripting language is not supported
Statements may only reference tables that exist in your workflow
Unnamed variables in the result are not supported
Duplicate variable names in the result are not supported
If your query is invalid, you will see an invalid icon and the error message displayed above your code. Transforms can't be run while any step is invalid.
The Order step sorts the table based on one or more variables.
Example output data:
Order by score
descending
Ordering large tables (>1GB) across variables with a large number of unique values requires substantial memory and isn't parallelizable. Order clauses in such cases may significantly slow your transform or cause it to fail.
There will be at least one order block where you will define a variable and a sort order.
When multiple blocks exist, the variables will be ordered and then sub-ordered in sequence.
We can sort a table to quickly see the highest scores.
Order by: The score
variable has the data we want to sort on, so we select it here
Sort: We want the data to go from smallest to largest values, so we choose ASC. There are no null values in this table, so we can choose either nulls first
or nulls last
and get the same result.
Output data:
Lets say instead we first wanted to sort first by year, then by the the lowest sales number.
First block
Order by: The first variable we want the data sorted on is score
so we choose it in the first block.
Sort: We want the earliest information first, so we know we want the information to be ascending. This variable has a null value so it matters whether we want nulls to appear first or last in the order. Since we want it last, we choose ASC (nulls last)
.
Second block
Order by: The second variable we want to sort on is date
so we put it here.
Sort: Since we want the most recent (highest) values first, we want it to be descending. There are no null values in this variable so where we put the nulls does not matter. We choose DESC (nulls first)
.
The Rename step changes the name of an existing variable.
Rename score
to final_score
There will be at least one rename block where you will define a variable and a new name.
We can change the name of a variable to give more meaning to the data.
Source variable: The variable we want to rename is date
so we select that here.
New name: date_taken
would be a better name for this variable, so we enter it here.
In the bottom pane of the transform you will need to select the variables you'd like to keep in your output table. One or more variables must be kept in order to run the transform.
Click on a variable and use the center buttons to move it between the two lists. You can use command-/control- and shift-click to select multiple variables at once.
For larger variable lists, clicking the search icon will allow you to quickly filter the available variable options down based on search terms.
You can select the Distinct box to drop any records that are an exact copy of another record across values in all variables in your output table.
Hovering on any variable will show a tooltip with information about where the variable came from, it's type, and any assigned label.
You will also see to the left of variables in the list an icon showing the variable's source. For variables originating in the source table, the icon will show t0
and for variables from subsequently joined in tables the icon will show t1
, t2
, etc. For variables created in this transform it will show v
.
Every method is documented here with what information the Redivis interface needs, as well as a link to the underlying BigQuery documentation for more details.
Value lists allow you to persist a group of values in a list, for use throughout a workflow. You can centrally view and edit this list, and any changes to it will be pushed to all transforms where it is being used.
This may be useful in a scenario where a list of diagnosis codes are entered in multiple filters that may change over time as you build you analysis, or when you have a single parameter you are using in multiple places that you want to be able to quickly change anywhere.
To create and update value lists, you can click the Lists button at the top left of workflow tool top bar.
You can create, modify, and delete lists from this interface.
For bulk importing you can copy/paste values from any standard spreadsheet program, or import them from a csv or json file. Please note that string list values should not be quoted — learn more about list types below.
You can then reference the list in transforms throughout your workflow anywhere that you could input multiple values.
Values can be copied into lists from any spreadsheet program by directly using the clipboard. For bulk or scripted actions, you may also import multiple lists as a JSON or CSV file. Note that any list exports will follow the same format.
Each variable in your CSV will represent one list, with the first row containing the name of the list. E.g.:
JSON specification
The same set of value lists can be imported via JSON, with the following format:
Redivis will automatically determine the most specific type for all the values in a list, falling back to the "universal type" of string.
For example, these combinations will yield the corresponding specific types and can be compared to the following types:
Redivis supports the and features, with . All queries must reference the SQL step's source, represented as _source_
in the query. If the SQL step is the first step in the transform, the source represents the transform's source table. Otherwise, the source represents the output of the previous step in the transform.
You can also reference any other table in the current workflow from your SQL step, following the same reference rules as . After typing the backtick character (`
), available tables will be presented as autocomplete options.
You may also use in your query — they can be referenced as a SQL parameter, prefixed with the @
symbol. Note that all value lists are explicitly typed, and you will need to cast the value list if there is a type mismatch.
This can be useful in situations where you using analytic methods or if your data is in a particular format. Otherwise this might slow down your transform run time unnecessarily.
Some steps will add or remove variables as options for selection to keep in the discard table. and occasionally will bring in variables from other tables. will make new variables. will remove all variables that are not included in the collapse, but has the potential to add newly created aggregate variables.
When using a step (or when creating aggregate variables in an or step) you will need to select a method to specify how the variable will be created.
If you update the values within a list, all transforms that reference the list will be marked as .
All variables in Redivis have a . When using lists to filter rows the list type needs to be the same (or coercible to) the variable that is being filtering on.
Order by
The variable containing the values that will be sorted.
Sort
A choice of how all values in the Order by
variable will be sorted:
ASC (nulls first)
ASC (nulls last)
DESC (nulls first)
DESC (nulls last)
Note that variables of the geography data type aren't sortable.
Source variable
The variable that you want to rename.
New name
The name that your selected source variable
will be given.
Note that this input needs to follow Redivis variable naming constraints.
Fruits
Vegetables
Item_codes
Strawberry
Broccoli
1
Tomato
Asparagus
2
Apple
Spinach
3
Values
Type
Can compare to
1
, 2
, 3
Integer
Integer, Float, String
1
, 2.0
, 3
Integer
Integer, Float, String
1
, 1.0,
1.1
Float
Float, String
1
, 2.0
, a
String
String
1971-01-01
, 1972-02-02
, 1973-03-03
Date
Date, String
1971-01-01
, a
, b
String
String
1970-01-01 12:00:00
, 1970-01-01 12:30:00
DateTime
DateTime, Date, String
The Redivis transform connects to a highly performant, parallelized data store. Queries on terabytes of data can complete in seconds, often utilizing the resources of thousands of compute nodes. These are some best practices that can help you increase performance.
Table writes are generally much slower than table reads — if your output table is exceptionally large, it may take the querying engine several minutes to materialize the output. Try restricting the number of rows returned by applying row filters to your transforms when possible, and be cognizant of joins that may substantially increase your row count. Avoid keeping variables that aren't needed.
When you are performing initial exploration, you may consider using a limit step in your transform to reduce the output table size.
Each new variable adds to the computational complexity of the query; the new variable must be computed for every row in the table.
A common anti-pattern is to construct numerous boolean CASE
new variables, and then use the result of these new variables in the row filter(s). If possible, it is far more efficient to inline the CASE
logic within the row filters, or within fewer new variables, as this allows for improved efficiency in logical short-circuiting.
When your query utilizes a join step, consider the order in which you are joining the data. The best practice is to place the largest table first, followed by the smallest, and then by decreasing size.
While the query optimizer can determine which table should be on which side of the join, it is still recommended to order your joined tables appropriately.
If all of your joins are INNER
joins, the join order will have no impact on the final output. If your query leverages combinations of left / right / inner joins, the join order may affect your output; be careful in these cases.
The Redivis transform will prevent you from initiating a run when any steps are invalid, and errors should be rare. However, some errors can only be detected as the query is run — in these cases the job will fail as soon as it encounters an error, logging the error message to the top of the transform.
If you come across an error message not on this list, please email support@redivis.com for further assistance.
This error occurs when a query utilizes too much memory, yet it is often easily resolvable and due to unintended behavior within the transform. This error is caused by a certain component of the query not being parallelizable, which often occurs when combining and / or ordering many distinct values. We recommend investigating the following culprits:
Investigate any order
clauses in your transform — either at the bottom of the transform or in a partitioned query. Often, attempts to order on hundreds of millions or billions of distinct values will fail. Note that ordering rows at the bottom of the transform does not affect your output, and is only useful in preparing your data for export.
Confirm that none of your aggregation methods are creating massive cells. For example, using the String aggregate
method on an exceptionally large partition can collapse and concatenate many values into one record — if the cell becomes too big, this error will be thrown.
When converting between variable types, all values must be appropriately formatted for conversion to the new type. For example, the value "1,000"
is not a valid integer and will throw an error when being converted from a string to an integer.
There are several options for getting around cast errors:
Choose the "If invalid for type, set to null" option in your retype blocks. Note that this will set all invalid values to NULL
, potentially causing unintended side effects. Use with caution.
Filter out all records that have incompatible values.
Create a new variable, using the Case
method to convert any invalid values to something that can be appropriately cast.
If an output table is more than 1TB, it cannot exceed the size of the sum of all source tables, + 10%. Very large output tables that substiantially exceed their inputs are typically the result of a misconfigured join that generates a cross-product between a one-to-many or many-to-many relationhip between multiple tables. If you encounter this error, try to apply filter and aggregation steps first, and also validate that your join conditions are appropriately specific.
This error may occur when running queries on tables belonging to an unreleased version, particularly when these tables are made up of hundreds of independent uploads. Under the hood, these unreleased tables are represented as a logical view that stitches the various uploads together into a single table. If your query references multiple unreleased tables, with each approaching the 500 per-table upload limit, it's possible to exceed the total allowed number of tables referenced by a query.
To work around this issue, you can create a transform that simply selects all variables from the unreleased table, materializing the result in your workflow. This output table will now only count as a single table in your query, avoiding this error.
This error will also no longer be an issue once the version is released, as the table is materialized shortly after a version is released.
If you come across any other errors or issues while using the transform please contact us directly at support@redivis.com
When creating a new variable, you'll use a variety of inputs to specify the parameters required to fully define a given method. Each parameter of the following types:
Refers to a Redivis variable, often limited to a specific subset of types. When using variable parameters in a given transform, you'll be able to select variables from the source table, variables in tables referenced by joins, or new variables created upstream of the parameter.
Refers to a string, integer, boolean, or floating point value. When using literal parameters, you'll be able to type any constant value (e.g, false
, 1.0
, 'test'
), or use a value list to reference the same set of values anywhere in your workflow. If using a literal parameter alongside another parameter, e.g., in a filter comparison, you may have to match the type of literal with that parameter.
Refers to a set of distinct values, usually of a homogenous type. For example, if a required parameter used a "timezone" enum with 3 options (PST
, MST
or EST
), you would have to select one of the three options (PST
) for the value of that parameter.
Refers to values true
or false
.
In many places throughout the transform you will have an option in the system menu to set a cell value(s) to NULL. Conceptually this means that the cell is empty and contains no value. If you look at the Cells view of a table, you will see NULL values shown in a grey color to indicate that these cells are empty.
Note that this is different than if the cell contained a string with the characters NULL. In this case the cell does have contents and will be treated so.
You will see throughout the transform references to how nulls are included or excluded for steps (such as the Order step) and in calculating summary statistics.
Some new variable methods (PARSE
... and FORMAT
...) allow for manipulation of Date, Time, or DateTime variable types. To do so, you'll need to define how the data is formatted using format elements.
For example, to work with dates in a mm/dd/yy
format (common in the U.S.; e.g. 03/22/89
), we would specify the format string %m/%d/%y
. For a DateTime displayed as Mon Oct 17 2016 17:32:56
, we would specify %a %b %d %Y %H:%M:%S
Below is a complete list of format elements and descriptions:
Format element
Description
%A
The full weekday name.
%a
The abbreviated weekday name.
%B
The full month name.
%b or %h
The abbreviated month name.
%C
The century (a year divided by 100 and truncated to an integer) as a decimal number (00-99).
%c
The date and time representation.
%D
The date in the format %m/%d/%y.
%d
The day of the month as a decimal number (01-31).
%e
The day of month as a decimal number (1-31); single digits are preceded by a space.
%F
The date in the format %Y-%m-%d.
%G
The ISO 8601 year with century as a decimal number. Each ISO year begins on the Monday before the first Thursday of the Gregorian calendar year. Note that %G and %Y may produce different results near Gregorian year boundaries, where the Gregorian year and ISO year can diverge.
%g
The ISO 8601 year without century as a decimal number (00-99). Each ISO year begins on the Monday before the first Thursday of the Gregorian calendar year. Note that %g and %y may produce different results near Gregorian year boundaries, where the Gregorian year and ISO year can diverge.
%H
The hour (24-hour clock) as a decimal number (00-23).
%I
The hour (12-hour clock) as a decimal number (01-12).
%j
The day of the year as a decimal number (001-366).
%k
The hour (24-hour clock) as a decimal number (0-23); single digits are preceded by a space.
%l
The hour (12-hour clock) as a decimal number (1-12); single digits are preceded by a space.
%M
The minute as a decimal number (00-59).
%m
The month as a decimal number (01-12).
%n
A newline character.
%P
Either am or pm.
%p
Either AM or PM.
%R
The time in the format %H:%M.
%r
The 12-hour clock time using AM/PM notation.
%S
The second as a decimal number (00-60).
%s
The number of seconds since 1970-01-01 00:00:00. Always overrides all other format elements, independent of where %s appears in the string. If multiple %s elements appear, then the last one takes precedence.
%T
The time in the format %H:%M:%S.
%t
A tab character.
%U
The week number of the year (Sunday as the first day of the week) as a decimal number (00-53).
%u
The weekday (Monday as the first day of the week) as a decimal number (1-7).
%V
The week number of the year (Monday as the first day of the week) as a decimal number (01-53). If the week containing January 1 has four or more days in the new year, then it is week 1; otherwise it is week 53 of the previous year, and the next week is week 1.
%W
The week number of the year (Monday as the first day of the week) as a decimal number (00-53).
%w
The weekday (Sunday as the first day of the week) as a decimal number (0-6).
%X
The time representation in HH:MM:SS format.
%x
The date representation in MM/DD/YY format.
%Y
The year with century as a decimal number.
%y
The year without century as a decimal number (00-99), with an optional leading zero. Can be mixed with %C. If %C is not specified, years 00-68 are 2000s, while years 69-99 are 1900s.
%%
A single % character.
%E#S
Seconds with # digits of fractional precision.
%E*S
Seconds with full fractional precision (a literal '*').
%E4Y
Four-character years (0001 ... 9999). Note that %Y produces as many characters as it takes to fully render the year.
Some new variable methods (JSON extract
..., JSON scalar
, etc) allow for accessing subsets of the data contained within JSON-formatted strings. To do so, you'll need to define how the data is accessed using a JSONPath element.
For example, to access the firstName
attribute within a JSON-formatted string { "firstName": "John", "lastName": "Doe" }
, you'd specify a JSONPath element $.firstName
.
More information is available in BigQuery's JSONPath format documentation.
This common method utilizes if-then-else logic, assigning a result when the corresponding condition evaluates to true, otherwise assigning the final ("else") result –> learn more
dynamic (input-dependent)
Returns any value from the input or NULL if there are zero input rows –> learn more
dynamic (input-dependent)
@variable
true
-
Returns the average of all non-null values –> learn more
float
@variable
true
-
Returns the count of all non-null values –> learn more
integer
@variable
false
(*)
@distinct
true
-
Returns the logical AND of all non-NULL expressions –> learn more
boolean
@variable
true
-
Returns the logical OR of all non-NULL expressions –> learn more
boolean
@variable
true
-
Returns the maximum value of all non-null inputs –> learn more
dynamic (input-dependent)
@variable
true
-
Returns the minimum value of all non-null inputs –> learn more
dynamic (input-dependent)
@variable
true
-
Returns a string obtained by concatenating all non-null values –> learn more
string
@variable
true
-
@delimiter
false
(",")
@distinct
true
-
@limit
false
-
Returns the sum of all values, ignoring nulls –> learn more
dynamic (input-dependent)
@variable
true
-
The Retype step converts a variable of a given type to another type.
Retype score
from integer to float.
There will be at least one retype block where you will define a variable and a new type.
Source variable
The variable that you want to retype. Note that you can see its current type by locating it in the variable selector at the bottom of the page, or hovering over the variable in this selection menu to see a tooltip with more information.
New type
If invalid for type, set to null
Whether failing type conversions should be converted to a null
value. By default, failed type conversions will throw an error.
Note that this might significantly change the content of your data and we suggest using this option with full understanding of how it might affect your outcome and verifying the output results.
Specify custom format
We can convert sales data currently stored as an integer type into float to use in another formula elsewhere that only accept the float type.
Source variable: The variable we want to convert is score
so we select it here.
New type: We want this to be a float, and since the score variable is currently an integer it is compatible with conversion to the float type, so we can select Float
here.
If invalid for type, set to null: Since there are no incompatible values in this variable it doesn't matter what we choose. We leave it unchecked to validate that we understand our data and to confirm that this transform will execute without failing.
Output data:
Our output table looks as it should when we look at the cells in the output table, and we can confirm the new type by clicking on the variable in the output table to check the type.
Lets say instead that in our initial data, the sales
variable was stored as a string. Converting this to a float would be a bit trickier since the data entry wasn't as clean.
Starting data:
Source variable: Same as above example.
New type: Same as above example.
If invalid for type, set to null: Since this data has 35%
as a value, this can't be converted to a float. If we leave this box unchecked our transform will fail. Checking it will set that value to null
.
Note that while this retype was successful, our data might not be what we want, and in this case removes information.
Let's say we want to convert our year variable which is currently a string to a Date variable type, but that the starting format does not cleanly translate.
Source variable: The variable we want to convert is date
so we select it here.
New type: We want this to be a date so we can select Date
here.
If invalid for type, set to null: Since there are no incompatible values in this variable it doesn't matter what we choose. We leave it unchecked to validate that we understand our data and to confirm that this transform will execute without failing.
Specify custom format: Since our data does not fit the standard format (%Y-%m-%d, e.g. 2020-10-01
) we need to specify what format it is in. We can use the reference table at the bottom of this page to specify our format: MM/DD/YYYY
Note you can see more information on Redivis variable types here.
String
Integer Float Boolean Date DateTime Time Geography
To integer: A hex string can be cast to an integer. For example, 0x123
to 291
or -0x123
to -291
.
To float: Returns x
as a floating point value, interpreting it as having the same form as a valid floating point literal.
To boolean: Returns TRUE
if x
is "true"
and FALSE
if x
is "false"
. All other values of x are invalid and throw an error instead of casting to a boolean. A string is case-insensitive when converting to a boolean.
To date, dateTime, or time: Uses the canonical format by default (see information below)
Integer
String
Float Boolean
To float: Returns a close but potentially not exact floating point value.
To boolean: Returns FALSE
if x
is 0
, TRUE
otherwise.
Float
String Integer
To integer: Returns the closest integer value. Halfway cases such as 1.5 or -0.5 round away from zero.
Boolean
String Integer
To string: Returns true
if x is true, false
otherwise.
To integer: Returns 1
if x is true, 0
otherwise.
Date
String
DateTime
String Date Time
To date, dateTime, or time: Uses the canonical format by default (see information below)
Time
String Date DateTime
To date, dateTime, or time: Uses the canonical format by default (see information below)
Geography
String
When retyping between a String type variable and a Date, DateTime, or Time type variable it is presumed that the data will be in the format below.
Date
(Four digit year)- (1 or 2 digit month)- (1 or 2 digit date)
2023-01-01 2023-1-1
Time
(1 or 2 digit hour): (1 or 2 digit minute): (1 or 2 digit second). (Up to 6 fractional seconds)
01:01:01.123456 6:2:9 22:19:3
DateTime
(Date specification)
(space
or T
or t
)
(Time specification)
2023-01-01 01:01:01.123456 2023-1-1T6:2:9
If it is not in this canonical format you can click the Specify custom format field and use Format elements (below) to indicate otherwise.
Since Date, DateTime, and Time variable types contain structured information you can use format strings to indicate how you want different pieces of date and time information translated to and from string format when retyping.
For example when converting a date to a string you can choose whether it will become JAN 1 2023 or 2023-01-01. When converting from a string to a DateTime you'll need to outline how the information is structured in your data so it can be read in correctly.
You can use these elements in the Specify custom format field.
YYYY
Four (or more) digit year
Input: 2023-01-01 Output: 2023
Input: 23-01-01 Output: 0023
Input: 20000-01-01 Output: 20000
YYY
Last three digit year
Input: 2023-01-01 Output: 023
Input: 23-01-01 Output: 023
YY
Two digit year
Input: 2023-01-01 Output: 23
Input: 2-01-30 Output: 02
Y
Last one digit of year
Input: 2023-01-01 Output: 3
MM
Two digit month
Input: 2023-01-01 Output: 23
MON
Three character month: JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, DEC
Input: 2023-01-01 Output: JAN
MONTH
Month name
Input: 2023-01-01 Output: JANUARY
DDD
Three digit day of the year
Input: 2023-01-01 Output: 001
DD
Two digit day of the month
Input: 2023-01-01 Output: 01
D
Day of the week (1-7) with Sunday being 1
Input: 2023-01-01 Output: 1
DAY
Day of the week. Spaces are padded on the right side to make the output size exactly 9.
Input: 2023-01-01 Output: SUNDAY
DY
Three character day: MON, TUE, WED, THU, FRI, SAT, SUN
Input: 2023-01-01 Output: SUN
HH
Two digit hour of the day (valid values from 00 to 12)
Input: 20:10:15 Output: 10
HH12
Hour of the day (valid values from 00 to 12)
Input: 20:10:15 Output: 10
HH24
Two digit hour (valid values from 00 to 24)
Input: 20:10:15 Output: 20
MI
Two digit minute
Input: 20:10:15 Output: 10
SS
Two digit second
Input: 20:10:15 Output: 15
SSSSS
Five digit second
Input: 20:10:15 Output: 15234
FFn
(Replace n
with a value from 1 to 9. For example, FF5
.)
Fractional part of the second, n
digits long. The fractional part of the second is rounded to fit the size of the output.
FF1
Input: 20:10:15 Output: 1
FF2
Input: 20:10:15
Output: 15
FF3
Input: 20:10:15
Output: 015
A.M.
or AM
P.M.
or PM
A.M. (or AM) if the time is less than 12, otherwise P.M. (or PM). The letter case of the output is determined by the first letter case of the format element.
AM
Input: 09:10:15 Output: AM
A.M.
Input: 20:10:15
Output: P.M.
PM
Input: 09:10:15
Output: AM
PM
Input: 20:10:15
Output: PM
TZH
Hour offset for a time zone. This includes the +/-
sign and 2-digit hour.
Input: 2008-12-25 05:30:00+00 Output: −08
TZM
Minute offset for a time zone. This includes only the 2-digit minute.
Input: 2008-12-25 05:30:00+00 Output: 00
A space
Input: Output:
-./,'l;:
Same character in the output
Input: -./,'l;: Output: -./,'l;:
"Text"
Output is the value within the double quotes. To preserve a double quote or backslash character, use the \"
or \\
escape sequence. Other escape sequences are not supported.
Input: "abc" Output: abc
Input: "a\"b\\c" Output: a"b\c
Returns the current date –> learn more
date
@time_zone
any of: ACDT
, ACST
, ACT
, ACT
, ACWST
, ADT
, AEDT
, AEST
, AFT
, AKDT
, AKST
, AMST
, AMT
, AMT
, ART
, AST
, AST
, AWST
, AZOST
, AZOT
, AZT
, BDT
, BIOT
, BIT
, BOT
, BRST
, BRT
, BST
, BST
, BST
, BTT
, CAT
, CCT
, CDT
, CDT
, CEST
, CET
, CHADT
, CHAST
, CHOT
, CHOST
, CHST
, CHUT
, CIST
, CIT
, CKT
, CLST
, CLT
, COST
, COT
, CST
, CST
, CST
, CT
, CVT
, CWST
, CXT
, DAVT
, DDUT
, DFT
, EASST
, EAST
, EAT
, ECT
, ECT
, EDT
, EEST
, EET
, EGST
, EGT
, EIT
, EST
, FET
, FJT
, FKST
, FKT
, FNT
, GALT
, GAMT
, GET
, GFT
, GILT
, GIT
, GMT
, GST
, GST
, GYT
, HDT
, HAEC
, HST
, HKT
, HMT
, HOVST
, HOVT
, ICT
, IDLW
, IDT
, IOT
, IRDT
, IRKT
, IRST
, IST
, IST
, IST
, JST
, KALT
, KGT
, KOST
, KRAT
, KST
, LHST
, LHST
, LINT
, MAGT
, MART
, MAWT
, MDT
, MET
, MEST
, MHT
, MIST
, MIT
, MMT
, MSK
, MST
, MST
, MUT
, MVT
, MYT
, NCT
, NDT
, NFT
, NPT
, NST
, NT
, NUT
, NZDT
, NZST
, OMST
, ORAT
, PDT
, PET
, PETT
, PGT
, PHOT
, PHT
, PKT
, PMDT
, PMST
, PONT
, PST
, PST
, PYST
, PYT
, RET
, ROTT
, SAKT
, SAMT
, SAST
, SBT
, SCT
, SDT
, SGT
, SLST
, SRET
, SRT
, SST
, SST
, SYOT
, TAHT
, THA
, TFT
, TJT
, TKT
, TLT
, TMT
, TRT
, TOT
, TVT
, ULAST
, ULAT
, UTC
, UYST
, UYT
, UZT
, VET
, VLAT
, VOLT
, VOST
, VUT
, WAKT
, WAST
, WAT
, WEST
, WET
, WIT
, WST
, YAKT
, YEKT
false
(UTC (default))
Constructs a date from a year, month, and day –> learn more
date
@year
true
-
@month
true
-
@day
true
-
Add a period of time to a date –> learn more
date
@integer_expression
true
-
@date_part
any of: year
, month
, quarter
, week
, day
true
((e.g., days))
@date_expression
true
-
Calculate the distance between two dates –> learn more
integer
@date_expression_2
true
-
@date_expression
true
-
@date_part
any of: day
, week
, month
, quarter
, year
true
((e.g., days))
Subtract a period of time from a date –> learn more
date
@integer_expression
true
-
@date_part
any of: day
, week
, month
, quarter
, year
true
((e.g., days))
@date_expression
true
-
Truncates a date to the nearest boundary –> learn more
date
@date_expression
true
-
@date_truncate_part
any of: day
, month
, quarter
, year
, week(sunday)
, week(monday)
, week(tuesday)
, week(wednesday)
, week(thursday)
, week(friday)
, week(saturday)
, ISOquarter
, ISOyear
true
((e.g., month))
Extracts the date part (e.g, month) from a date –> learn more
integer
@date_part
any of: DAYOFWEEK
, DAY
, DAYOFYEAR
, WEEK
, MONTH
, QUARTER
, YEAR
, ISOWEEK
, ISOYEAR
true
((e.g., month))
@date_expression
true
-
Returns a formatted string from a date –> learn more
string
@date_expression
true
-
@format_string
true
(e.g., %Y-%m-%d)
More details about format strings here
Parses a date from a string –> learn more
date
@date_string
true
-
@format_string
true
(e.g., %Y-%m-%d)
@safe
true
-
More details about format strings here
Extracts a JSON value and converts it to a SQL JSON-formatted STRING. –> learn more
string
@expression
true
-
@expression_2
true
-
Extracts a JSON scalar value and converts it to a SQL JSON-formatted STRING, removing outermost quotes and un-escaping return values –> learn more
string
@expression
true
-
@expression_2
false
($)
dynamic (input-dependent)
dynamic (input-dependent)
dynamic (input-dependent)
dynamic (input-dependent)
dynamic (input-dependent)
float
float
dateTime
dateTime
dateTime
integer
dateTime
dateTime
integer
string
dateTime
dynamic (input-dependent)
dynamic (input-dependent)
float
integer
float
float
dynamic (input-dependent)
boolean
boolean
dynamic (input-dependent)
float
integer
float
float
float
float
integer
float
float
@condition
(nested) conditions – like those used in a
any
true
-
@result
or
any
true
(Variable or value)
@else_result
or
any
true
(Variable or value)
any
any ,
any
any
any
any
any
any
any
any
any
any
any ,
The type that the Source variable
will be converted to. Can be any of the on Redivis. More on .
Informs how the data will be read by the method. Only relevant for conversions of strings to date, time, and dateTime. .
or
any
or
any
or
any
or
any
or
any
or
any
or
any
or
any
or
any
or
any
or
any , ,
or
any
any
or
any
any
any
or
any
any
or
any
any
Returns the value of a variable for the first row in a given analytic window. –>
Returns the value of a variable on a preceding row within the analytic window –>
Returns the value of a variable for the last row in a given analytic window. –>
Returns the value of a variable on a subsequent row within the analytic window –>
Returns the value at the Nth row of a given window frame. –>
Computes the specified percentile value for a variable within an ordered partition, with linear interpolation. –>
Computes the specified percentile value for a variable within an ordered partition. Returns the first sorted value with cumulative distribution greater than or equal to the percentile. –>
Returns the current dateTime –>
Constructs a dateTime from a year, month, day, hour, minute, and second –>
Add a period of time to a dateTime –>
Calculate distance between two dateTimes –>
Subtract a period of time from a dateTime –>
Truncates a DateTime to the nearest boundary –>
Extracts the date or time part (e.g, hour) from a DateTime –>
Returns a formatted string from a dateTime –>
More details about format strings
Parses a dateTime from a string –>
More details about format strings
Returns the absolute value of a variable –>
Compute simple arithmetic (+, -, *, /) –>
Returns the smallest integral value that is not less than the provided value –>
Divide two integer values, rounding down any remainder –>
Compute the natural exponential of a value –>
Returns the largest integral value that is not greater than the provided value –>
Find the largest of several values –>
Return true if the value is positive or negative infinity, false otherwise. Returns NULL for NULL inputs –>
Determines whether input value is not a number (NaN) –>
Find the smallest of several values –>
Compute the logarithm of a value to a provided base; generates an error if the variable is <= 0. If no base is provided, defaults to natural log –>
Modulo: compute the remainder of the division of two integers –>
Raises a value to a power –>
Generate a pseudo-random float between [0, 1) –>
Rounds a value to the nearest integer (or, if specified, to the provided number of decimal places) –>
Equivalent to the divisor operator, but returns null if an error occurs, such as a division by zero error –>
Returns the sign (-1, 0, +1) of a numeric variable –>
Compute the square root of a value; generates an error if the variable is less than 0 –>
Similar to round, but rounds to the nearest integer whose absolute value is not greater than the absolute value of the provided variable (always rounds towards zero) –>
@variable
any Redivis type
true
-
@ignore_null
any of: IGNORE
, RESPECT
false
(Ignore nulls (default))
@variable
any Redivis type
true
-
@literal
any integer
false
(1)
@default_value
any Redivis type
false
(NULL)
@variable
any Redivis type
true
-
@ignore_null
any of: IGNORE
, RESPECT
false
(Ignore nulls (default))
@variable
any Redivis type
true
-
@literal
any integer
false
(1)
@default_value
any Redivis type
false
(NULL)
@variable
any Redivis type
true
-
@literal
any integer
true
-
@ignore_null
any of: IGNORE
, RESPECT
false
(Ignore nulls (default))
@variable
true
(variable)
@literal
any float
true
(Value between [0, 1])
@ignore_null
any of: IGNORE
, RESPECT
false
(Ignore nulls (default))
@variable
true
(variable)
@literal
any float
true
(Value between [0, 1])
@ignore_null
any of: IGNORE
, RESPECT
false
(Ignore nulls (default))
@time_zone
any of: ACDT
, ACST
, ACT
, ACT
, ACWST
, ADT
, AEDT
, AEST
, AFT
, AKDT
, AKST
, AMST
, AMT
, AMT
, ART
, AST
, AST
, AWST
, AZOST
, AZOT
, AZT
, BDT
, BIOT
, BIT
, BOT
, BRST
, BRT
, BST
, BST
, BST
, BTT
, CAT
, CCT
, CDT
, CDT
, CEST
, CET
, CHADT
, CHAST
, CHOT
, CHOST
, CHST
, CHUT
, CIST
, CIT
, CKT
, CLST
, CLT
, COST
, COT
, CST
, CST
, CST
, CT
, CVT
, CWST
, CXT
, DAVT
, DDUT
, DFT
, EASST
, EAST
, EAT
, ECT
, ECT
, EDT
, EEST
, EET
, EGST
, EGT
, EIT
, EST
, FET
, FJT
, FKST
, FKT
, FNT
, GALT
, GAMT
, GET
, GFT
, GILT
, GIT
, GMT
, GST
, GST
, GYT
, HDT
, HAEC
, HST
, HKT
, HMT
, HOVST
, HOVT
, ICT
, IDLW
, IDT
, IOT
, IRDT
, IRKT
, IRST
, IST
, IST
, IST
, JST
, KALT
, KGT
, KOST
, KRAT
, KST
, LHST
, LHST
, LINT
, MAGT
, MART
, MAWT
, MDT
, MET
, MEST
, MHT
, MIST
, MIT
, MMT
, MSK
, MST
, MST
, MUT
, MVT
, MYT
, NCT
, NDT
, NFT
, NPT
, NST
, NT
, NUT
, NZDT
, NZST
, OMST
, ORAT
, PDT
, PET
, PETT
, PGT
, PHOT
, PHT
, PKT
, PMDT
, PMST
, PONT
, PST
, PST
, PYST
, PYT
, RET
, ROTT
, SAKT
, SAMT
, SAST
, SBT
, SCT
, SDT
, SGT
, SLST
, SRET
, SRT
, SST
, SST
, SYOT
, TAHT
, THA
, TFT
, TJT
, TKT
, TLT
, TMT
, TRT
, TOT
, TVT
, ULAST
, ULAT
, UTC
, UYST
, UYT
, UZT
, VET
, VLAT
, VOLT
, VOST
, VUT
, WAKT
, WAST
, WAT
, WEST
, WET
, WIT
, WST
, YAKT
, YEKT
false
(UTC (default))
@year
any integer
true
-
@month
any integer
true
-
@day
any integer
true
-
@hour
any integer
true
-
@minute
any integer
true
-
@second
any integer
true
-
@integer_expression
any integer
true
-
@dateTime_part
any of: year
, quarter
, month
, week
, day
, hour
, minute
, second
, millisecond
, microsecond
true
((e.g., days))
@dateTime_expression
any dateTime
true
-
@dateTime_expression_2
any dateTime
true
-
@dateTime_expression
any dateTime
true
-
@dateTime_part
any of: year
, quarter
, month
, week
, day
, hour
, minute
, second
, millisecond
, microsecond
true
((e.g., days))
@integer_expression
any integer
true
-
@dateTime_part
any of: year
, quarter
, month
, week
, day
, hour
, minute
, second
, millisecond
, microsecond
true
((e.g., days))
@dateTime_expression
any dateTime
true
-
@dateTime_expression
any dateTime
true
-
@dateTime_truncate_part
any of: hour
, minute
, second
, millisecond
, microsecond
, day
, month
, quarter
, year
, week(sunday)
, week(monday)
, week(tuesday)
, week(wednesday)
, week(thursday)
, week(friday)
, week(saturday)
, ISOquarter
, ISOyear
true
((e.g., minutes))
@dateTime_part
any of: YEAR
, QUARTER
, MONTH
, WEEK
, DAY
, HOUR
, MINUTE
, SECOND
, MILLISECOND
, MICROSECOND
, DAYOFYEAR
, DAYOFWEEK
, ISOYEAR
, ISOWEEK
true
((e.g., hour))
@date_expression
any dateTime
true
-
@dateTime_expression
any dateTime
true
-
@format_string
any string
true
(e.g., %Y-%m-%d %H:%M:%S)
@dateTime_string
any string
true
-
@format_string
any string
true
(e.g., %Y-%m-%d %H:%M:%S)
@safe
any boolean
true
-
@variable
true
-
@expression
true
-
@operator
any of: +
, -
, *
, /
true
(E.g., +)
@expression_2
true
-
@variable
any float
true
-
@expression
any integer
true
-
@expression_2
any integer
true
-
@expression
true
-
@variable
any float
true
-
@expression
true
-
@variable
true
-
@variable
true
-
@expression
true
-
@expression
true
-
@literal
false
-
@expression
any integer
true
-
@expression_2
any integer
true
-
@expression
true
-
@expression_2
true
-
@expression
true
-
@literal
any integer
false
-
@expression
true
-
@expression_2
true
-
@variable
true
-
@expression
true
-
@expression
true
-
@literal
any integer
false
-
Takes three point GEOGRAPHY values, which represent two intersecting lines. Returns the angle between these lines. Point 2 and point 1 represent the first line and point 2 and point 3 represent the second line. The angle between these lines is in radians, in the range [0, 2pi). The angle is measured clockwise from the first line to the second line. –> learn more
float
@geography
true
-
@geography_2
true
-
@geography_3
true
-
Returns the area in square meters covered by the polygons in the input GEOGRAPHY –> learn more
float
@geography
true
-
Returns the RFC 7946 compliant GeoJSON representation of the input GEOGRAPHY –> learn more
string
@geography
true
-
Returns the WKT representation of an input GEOGRAPHY –> learn more
string
@geography
true
-
Takes two point GEOGRAPHY values, and returns the azimuth of the line segment formed by points 1 and 2. The azimuth is the angle in radians measured between the line from point 1 facing true North to the line segment from point 1 to point 2. –> learn more
float
@geography
true
-
@geography_2
true
-
Returns a single GEOGRAPHY that contains the union of the boundaries of each component in the given input GEOGRAPHY. –> learn more
geography
@geography
true
-
Returns a GEOGRAPHY that represents the buffer around the input GEOGRAPHY. This function is similar to ST_BUFFERWITHTOLERANCE, but you specify the number of segments instead of providing tolerance to determine how much the resulting geography can deviate from the ideal buffer radius. –> learn more
geography
@geography
true
-
@buffer_radius
true
-
@num_seg_quarter_circle
true
-
@endcap
any of: ROUND
, FLAT
false
(Round (default))
@side
any of: BOTH
, LEFT
, RIGHT
false
(Both (default))
Returns a GEOGRAPHY that represents the buffer around the input GEOGRAPHY. This function is similar to ST_BUFFER, but you provide tolerance instead of segments to determine how much the resulting geography can deviate from the ideal buffer radius. –> learn more
geography
@geography
true
-
@buffer_radius
true
-
@tolerance_meters
true
-
@endcap
any of: ROUND
, FLAT
false
(Round (default))
@side
any of: BOTH
, LEFT
, RIGHT
false
(Both (default))
Returns the centroid of the input GEOGRAPHY as a single point GEOGRAPHY. –> learn more
geography
@geography
true
-
Returns a GEOGRAPHY containing a point on Geography 1 with the smallest possible distance to Geography 2. This implies that the distance between the point returned by ST_CLOSESTPOINT and Geography 2 is less than or equal to the distance between any other point on Geography 1 and Geography 2. –> learn more
geography
@geography
true
-
@geography_2
true
-
Returns TRUE if no point of Geography 2 is outside Geography 1, and the interiors intersect; returns FALSE otherwise. –> learn more
boolean
@geography
true
-
@geography_2
true
-
Returns the convex hull for the input GEOGRAPHY. The convex hull is the smallest convex GEOGRAPHY that covers the input. A GEOGRAPHY is convex if for every pair of points in the GEOGRAPHY, the geodesic edge connecting the points are also contained in the same GEOGRAPHY. –> learn more
geography
@geography
true
-
Returns FALSE if Geography 1 or Geography 2 is empty. Returns TRUE if no points of Geography 1 lie in the exterior of Geography 2. –> learn more
boolean
@geography
true
-
@geography_2
true
-
Returns FALSE if Geography 1 or Geography 2 is empty. Returns TRUE if no points of Geography 2 lie in the exterior of Geography 1. –> learn more
boolean
@geography
true
-
@geography_2
true
-
Returns a GEOGRAPHY that represents the point set difference of Geography 1 and Geography 2. Therefore, the result consists of the part of Geography 1 that does not intersect with Geography 2. –> learn more
geography
@geography
true
-
@geography_2
true
-
Returns the dimension of the highest-dimensional element in the input GEOGRAPHY. –> learn more
integer
@geography
true
-
Returns TRUE if the intersection of Geography 1 and Geography 2 is empty, that is, no point in Geography 1 also appears in Geography 2. –> learn more
boolean
@geography
true
-
@geography_2
true
-
Returns the shortest distance in meters between two non-empty GEOGRAPHYs. –> learn more
float
@geography
true
-
@geography_2
true
-
Returns TRUE if the distance between at least one point in Geography 1 and one point in Geography 2 is less than or equal to the Distance argument, otherwise, returns FALSE –> learn more
boolean
@geography
true
-
@geography_2
true
-
@distance
true
-
Returns the last point of a linestring geography as a point geography. Returns an error if the input is not a linestring or if the input is empty. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –> learn more
geography
@geography
true
-
@safe
true
-
Returns TRUE if Geography 1 and Geography 2 represent the same GEOGRAPHY value. More precisely, this means that one of the following conditions holds: + ST_COVERS(geography, geography_2) = TRUE and ST_COVERS(geography_2, geography) = TRUE + Both Geography 1 and Geography 2 are empty. –> learn more
boolean
@geography
true
-
@geography_2
true
-
Returns a linestring geography that corresponds to the outermost ring of a polygon geography. If the input geography is a polygon, gets the outermost ring of the polygon geography and returns the corresponding linestring. If the input is the full GEOGRAPHY, returns an empty geography. Returns an error if the input is not a single polygon. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –> learn more
geography
@geography
true
-
@safe
true
-
Converts an expression for a STRING or BYTES value into a GEOGRAPHY value. If expression represents a STRING value, it must be a valid GEOGRAPHY representation in one of the following formats: WKT, WKB, GeoJSON –> learn more
geography
@expression
true
-
Returns a GEOGRAPHY value that corresponds to the input GeoJSON representation. –> learn more
geography
@geojson
true
-
@make_valid
true
-
Returns a GEOGRAPHY value that corresponds to the input WKT representation. –> learn more
geography
@wkt
true
-
@oriented
true
-
@planar
true
-
@make_valid
true
-
Converts an expression for a hexadecimal-text STRING or BYTES value into a GEOGRAPHY value. The expression must be in WKB format –> learn more
geography
@wkb
true
-
Creates a GEOGRAPHY with a single point. ST_GEOGPOINT creates a point from the specified FLOAT64 longitude and latitude parameters and returns that point in a GEOGRAPHY value. –> learn more
geography
@longitude
true
-
@latitude
true
-
Returns a GEOGRAPHY value that corresponds to a point in the middle of a bounding box defined in the GeoHash –> learn more
geography
@geohash
true
-
Returns a GeoHash representation of geography_expression. The resulting GeoHash will contain at most Max chars characters. Fewer characters corresponds to lower precision (or, described differently, to a bigger bounding box) –> learn more
string
@geography
true
-
@maxchars
true
-
Returns the Open Geospatial Consortium (OGC) geometry type that describes the input GEOGRAPHY as a STRING. The OGC geometry type matches the types that are used in WKT and GeoJSON formats and printed for ST_ASTEXT and ST_ASGEOJSON. ST_GEOMETRYTYPE returns the OGC geometry type with the "ST_" prefix. –> learn more
string
@geography
true
-
Returns a GEOGRAPHY that represents the point set intersection of the two input GEOGRAPHYs. Thus, every point in the intersection appears in both Geography 1 and Geography 2 –> learn more
geography
@geography
true
-
@geography_2
true
-
Returns TRUE if the point set intersection of Geography 1 and Geography 2 is non-empty. Thus, this function returns TRUE if there is at least one point that appears in both input GEOGRAPHYs. If ST_INTERSECTS returns TRUE, it implies that ST_DISJOINT returns FALSE. –> learn more
boolean
@geography
true
-
@geography_2
true
-
Returns TRUE if geography intersects the rectangle between [lng1, lng2] and [lat1, lat2]. The edges of the rectangle follow constant lines of longitude and latitude. lng1 and lng2 specify the westmost and eastmost constant longitude lines that bound the rectangle, and lat1 and lat2 specify the minimum and maximum constant latitude lines that bound the rectangle. –> learn more
boolean
@geography
true
-
@lng1
true
-
@lat1
true
-
@lng2
true
-
@lat2
true
-
Returns TRUE if the total number of points, linestrings, and polygons is greater than one. An empty GEOGRAPHY is not a collection. –> learn more
boolean
@geography
true
-
Returns TRUE if the given GEOGRAPHY is empty; that is, the GEOGRAPHY does not contain any points, lines, or polygons. –> learn more
boolean
@geography
true
-
Returns the total length in meters of the lines in the input GEOGRAPHY. –> learn more
float
@geography
true
-
Creates a GEOGRAPHY with a single linestring by concatenating the point or line vertices of each of the input GEOGRAPHYs in the order they are given. –> learn more
geography
@geography
true
-
@geography_2
true
-
Creates a GEOGRAPHY containing a single polygon from a linestring input, where the input linestring is used to construct a polygon ring. –> learn more
geography
@geography
true
-
Returns the longest distance in meters between two non-empty GEOGRAPHYs; that is, the distance between two vertices where the first vertex is in the first GEOGRAPHY, and the second vertex is in the second GEOGRAPHY. If Geography 1 and Geography 2 are the same GEOGRAPHY, the function returns the distance between the two most distant vertices in that GEOGRAPHY. –> learn more
float
@geography
true
-
@geography_2
true
-
Returns the number of geometries in the input GEOGRAPHY. For a single point, linestring, or polygon, ST_NUMGEOMETRIES returns 1. For any collection of geometries, ST_NUMGEOMETRIES returns the number of geometries making up the collection. ST_NUMGEOMETRIES returns 0 if the input is the empty GEOGRAPHY. –> learn more
integer
@geography
true
-
Returns the number of vertices in the input GEOGRAPHY. This includes the number of points, the number of linestring vertices, and the number of polygon vertices. –> learn more
integer
@geography
true
-
Returns the length in meters of the boundary of the polygons in the input GEOGRAPHY. –> learn more
float
@geography
true
-
Returns the Nth point of a linestring geography as a point geography, where N is the index. The index is 1-based. Negative values are counted backwards from the end of the linestring, so that -1 is the last point. Returns an error if the input is not a linestring, if the input is empty, or if there is no vertex at the given index. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –> learn more
geography
@index
true
-
@geography
true
-
@safe
true
-
Returns a simplified version of geography, the given input GEOGRAPHY. The input GEOGRAPHY is simplified by replacing nearly straight chains of short edges with a single long edge. The input geography will not change by more than the tolerance specified by tolerance_meters. Thus, simplified edges are guaranteed to pass within tolerance_meters of the original positions of all vertices that were removed from that edge. The given tolerance_meters is in meters on the surface of the Earth. –> learn more
geography
@geography
true
-
@tolerance_meters
true
-
Returns the input GEOGRAPHY, where each vertex has been snapped to a longitude/latitude grid. The grid size is determined by the grid_size parameter which is given in degrees. –> learn more
geography
@geography
true
-
@grid_size
true
-
Returns the first point of a linestring geography as a point geography. Returns an error if the input is not a linestring or if the input is empty. Use the SAFE prefix to obtain NULL for invalid input instead of an error. –> learn more
geography
@geography
true
-
@safe
true
-
Returns TRUE provided the following two conditions are satisfied: (1) Geography 1 intersects Geography 2 and (2) the interior of Geography 1 and the interior of Geography 2 are disjoint. –> learn more
boolean
@geography
true
-
@geography_2
true
-
Returns a GEOGRAPHY that represents the point set union of all input GEOGRAPHYs. –> learn more
geography
@geography
true
-
@geography_2
true
-
Returns TRUE if no point of Geography 1 is outside of Geography 2 and the interiors of Geography 1 and Geography 2 intersect. Given two geographies a and b, ST_WITHIN(a, b) returns the same result as ST_CONTAINS(b, a). Note the opposite order of arguments. –> learn more
boolean
@geography
true
-
@geography_2
true
-
Returns a float representing the west-most constant longitude line that bounds the geometry –> learn more
float
@geography
true
-
Returns a float representing the east-most constant longitude line that bounds the geometry –> learn more
float
@geography
true
-
Returns the longitude in degrees of the single-point input GEOGRAPHY –> learn more
float
@geography
true
-
@safe
true
-
Returns the latitude in degrees of the single-point input GEOGRAPHY –> learn more
float
@geography
true
-
@safe
true
-
Returns a float representing the minimum constant latitude line that bounds the geometry. –> learn more
float
@geography
true
-
Returns a float representing the maximum constant latitude line that bounds the geometry. –> learn more
float
@geography
true
-
Computes the centroid of the set of input GEOGRAPHYs as a single point GEOGRAPHY. –> learn more
geography
@geography
true
-
Returns a geography variable that represents the point set union of all input geographies. –> learn more
geography
@geography
true
-
Identifies high-density geography clusters and marks outliers in low-density areas of noise –> learn more
integer
@geography
true
-
@epsilon
true
-
@minimum_geographies
true
-
any
any
any
any
any
any
any
any
any
any
or
any ,
or
any ,
any
or
any ,
or
any ,
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any
or
any ,
any
any
any
any
any
any
or
any
or
any
any
or
any
any
any
any
or
any
or
any ,
or
any ,
or
any
any
or
any
any
any
any
any
any
any
or
any ,
or
any ,
or
any ,
or
any ,
any
any
any
any
any
any
any
any
any
any
any
or
any
any
any
any
or
any ,
any
or
any ,
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any
any ,
any