Create and populate a dataset

Overview

Datasets are a core component of Redivis. Consisting of documentation, metadata, and tables, datasets allow you to store, version, and distribute a wide variety of data.

Anyone with a Redivis account can create a dataset in their workspace, and organization administrators can upload one to an organization via the administrator panel.

1. Create the dataset

You can create a new dataset by navigating to the Datasets tab of your workspace or administrator panel and clicking the New dataset button.

All datasets must have a name (unique to datasets for the user / organization).

You can set up your dataset in whatever order you'd like, but we recommend following the order below when getting started.

2. Upload data

This data might be in a tabular format (.csv, .tsv, .sas etc.) or more rarely in unstructured data such as images and text files.

Tabular data

All tabular data is associated with a table and each dataset can have one or more tables. While you may release a dataset without any tables, this will be of limited use to other researchers, as Redivis provides numerous tools for understanding, querying, and generally working with tabular data.

If you haven't already worked with data in a workflow, we strongly recommend exploring that before creating dataset so you can understand how researchers will work with your data.

When you're ready to upload data, we have broken that out into a separate guide.

Learn more in the Upload tabular data as tables guide.

Unstructured data

For unstructured data, go to the Files tab where you can upload files from your computer or another location your data is stored via an integration. You can put these into folders and create index tables to better keep track of.

Note that any files uploaded here can't be transformed in the workflow tool or queried across Redivis (which require the table format).

Make sure any files you upload here contain this dataset's data. Any files with information about the data (such as data dictionaries or usage guides) should be uploaded as documentation on the Overview tab.

Learn more in the Upload unstructured data as files guide.

3. Edit metadata

It's easy to feel "done" after uploading your data, but documentation and metadata are essential to the usability of your dataset. Moreover, rich metadata will improve the discoverability of your dataset by providing more information and terms to the Redivis search engine.

Metadata can always be updated after your dataset has been released. While good metadata are essential, it can be a time consuming and iterative process, so you might prefer to provide some basic content initially, and then improve it over time.

Dataset metadata

On the overview tab of the dataset, you can provide an abstract, detailed documentation blocks, supporting files and links, and subject tags for the dataset.

The abstract should be a brief overview of the dataset, while the rest of the documentation can be as thorough as you'd like. Each documentation block has a header for suggested content, and any you don't fill out won't be shown on the dataset page. These blocks contain rich text editor complete with embedded images. Most of this information will be visible to anyone with overview access, though you can also create custom documentation sections that require a higher level of access.

Make sure to audit your data's provenance information to give attribution to whoever is working on the data. If this dataset is part of an organization you can configure a DataCite account to issue a DOI for each dataset. Note that if your organization is configured to issue DOIs, then one will automatically be issued for this dataset when you first publish it.

Table metadata

To help users understand what each table represents, you should update the description, entity, and temporal range for each table in the dataset. The entity should define what each row in a table represents: is it a person? an event? a charge? The temporal range can be tied to a specific variable (using the min/max of that variable), or defined explicitly.

Variable metadata

The tables in your dataset are made up of named variables, though rarely is this name enough to understand what the variable measures. On any table, click "Edit variable metadata" to order to populate the variable metadata.

On each variable, Redivis supports a label, description, and value labels. The label is the most essential item, think of it as a more human-readable variable name. The description should contain more detailed information, everything from caveats and notes to collection methodology. Value labels are only applicable when the variable is encoded with keys (often integers or short strings) that map to the actual value — for example, a survey might be encoded as 0: "No" 1: "Yes" 2: "Don't know" 3: "Declined to answer".

Editing variable metadata can be a tedious process, but Redivis does support the ability to import metadata from a file, and will also automatically extract metadata if it's present in the uploaded data files (e.g., Stata or SAS upload types).

Learn more in the Documentation reference section.

4. Create a sample

If your dataset is particularly large, or if you want to control access to a sample of the data separately from the whole dataset, you should configure sampling on your dataset. This will allow researchers to work with a 1% sample of the data during initial exploration, and allow you to grant access to the sample independently of the full dataset.

To update the dataset's sample configuration, click on any table, and then click "Configure sample". When configuring the sample, you can generate a random sample for each table, or sample on a particular variable that is common across tables. If researchers will be joining tables across your dataset, it is highly recommended that you sample on that common join variable so that researchers can contain a consistent 1% sample as they work with your data.

Learn more in the Dataset sampling reference section.

5. Configure access

Before releasing your dataset, it is important to define who can access the dataset and what the procedures are for applying and gaining access. Click the Configure access button on the top of the page to set up the access configuration.

Datasets owned by organizations have more options for access than datasets owned by users.

Access levels

Dataset access has five levels:

Overview: the ability to see a dataset and its documentation.
Metadata: the ability to view variable names and summary statistics.
Sample: the ability to view and query a dataset's 1% sample. This will only exist for datasets that have a sample configured.
Data: the ability to view and query a dataset's tables, and work with them in workflows.
Edit: the ability to edit the dataset and release new versions.

Access levels are cumulative. For example, in order to gain data access you will need to have gained metadata access as well.

Usage rules

Even with data access, you may want to limit what other users can do with your dataset. Currently, you can configure export restrictions that limit:

The download location (e.g., to prevent researchers from downloading to their personal computer)
The download size, in bytes and/or rows
Enforce admin approval before any export

Editors

You may also add additional dataset editors to help upload data and provide metadata content. These editors will be able to create and release new versions, and will have full access to the underlying data, though they cannot add other users, modify the access configuration, or bypass the dataset usage rules.

If the dataset is hosted by an organization, all administrators of the organization will be able to the edit the dataset as well as its access configuration.

Permission groups

If the dataset is hosted by an organization, you will have additional options for configuring access to the dataset. The dataset can be assigned to a permission group to help standardize access procedures, and this permission group can contain requirements that help data managers fulfill contractual requirements and gather relevant information about the research being done on the dataset.

Learn more in the Configure access systems guide.

5. Release the dataset

Congratulations! Your dataset is ready to be released and utilized by the research community. But first, it is highly recommended that you validate and audit your dataset beforehand. Take a look at the number of rows, variables, and uploads in each table. Validate some of the variable summary statistics against what you expect. And to be truly thorough, add the dataset to a workflow and run some queries as if you were a researcher. Catching a mistake now will prevent headaches down the line if researchers uncover unexpected discrepancies in the data.

Once a version has been released, the data can no longer be edited. While you can unrelease a version within 7 days, this should generally be avoided; you'll need to release a new version to modify the data.

When you're confident that you're ready to go, click the "Release" button on the top of the page. If the button is disabled, hover over it to understand what issues are currently preventing you from releasing.

After clicking the button, you'll be presented with a final checklist of tasks. When you click the Release version button, the dataset will be immediately released and available to all users with access.

This dataset is now also considered Published. If you need to pause all activity and access to this dataset, you can return to this page in the future and Unpublish it temporarily.

6. Make updates as new versions

Once a dataset is released, you can return to it to make changes at any time. Changes to datasets are tracked in Redivis as versions. Anyone with access to a dataset can view and work with any of its versions.

How to work with versions when updating a dataset:

Any edits to the data content in tables will need to be released as a new version.
Edits to the dataset information, table information, or variable metadata can be made on the current version (or historic versions) and will be live as soon as it's saved.
Edits to the dataset name and access configuration will always affect all versions.

Creating the next version

All data within a dataset is encapsulated in discrete, immutable versions. Every part of the dataset except for the name and access settings are versioned. All tables in a dataset are versioned together.

After releasing the first version of the dataset, you can choose to create a new version at any time by clicking the button in the top right "Create next version". This version will be created as vNext, and you may toggle between this and historic versions at any time.

Subsequent versions always build on the previous version of the dataset, and changes made in the next version will have no affect on previous versions. Alongside modifications to the dataset's metadata, you may create, update, or delete any of the previous version's tables.

Replacing vs appending data

When uploading data to a previous table, you can choose whether you want to append these new uploads to your existing data, or replace the entire table with the new data.

Version storage costs

Redivis computes row-level diffs for each version, efficiently storing the complete version history in one master table. This allows you to regularly release new versions and maintain a robust version history without ballooning storage costs.

Learn more in the Usage and limits for users, and Billing for organizations reference sections.

Next steps

Start working with your data

Once your dataset is released, bring it into a workflow to transform and analyze it leveraging lightning fast tools from your browser.

Learn more in the Analyze data in a workflow guide.

Last updated 6 months ago

Was this helpful?