Creating a dataset

Overview

Datasets are a core component of Redivis. Consisting of documentation, metadata, and tables, datasets allow you to store, version, and distribute a wide variety of data.

Users can create their own dataset in their workspace, and organization administrators can upload one on behalf of an organization via the administrator panel. Of course, Redivis is already home to thousands of datasets – you can apply for access, create projects, and export and visualize data without ever uploading your own.

1. Create the dataset

You can create a new dataset by navigating to the Datasets tab of your workspace or administrator panel and clicking the New dataset button.

All datasets must have a name (unique to datasets for the user / organization). You can also provide a description, additional documentation, and tags that will show up on the dataset's overview page. This information can also be edited at any later point.

Create a new, blank dataset from your workspace or organization

2. Create tables

All data is associated with a table and each dataset can have one or more tables. While you may release a dataset without any tables, this will be of limited use to other researchers, as Redivis provides numerous tools for understanding, querying, and generally working with tabular data.

To get started, create a table (on the "Tables" tab of the dataset) and give it a name.

Uploading data

On a table, click the "Upload data" button to begin adding your data. You can then upload data to your table from a wide variety of data sources and formats.

If the data that you're uploading is split across multiple files with the same general variable structure (e.g., one file per year) you can upload them all together into one table and they will automatically be appended based on common variable (column) names. If a variable is missing in some of the files, that's ok, it will just be recorded as null for all records in that file.

If your dataset has multiple files with a different variable structure, you will want to create multiple tables to upload each to separately.

Take the time to think about your dataset and table structure before you start!

A dataset should contain either one or a group of related tables that a researcher would combine, likely on common identifiers that exist across the tables. A dataset is also a singular point of access, so all the tables within a dataset should fall under the same access paradigm.

When uploading files to tables, remember that every row in a table should represent the same "thing", or entity; we wouldn't want to combine county-level and state-level observations in one table.

It's also helpful to think about how researchers will want to work with your dataset. In general, fewer tables are better — for example, it will likely be much easier for researchers if one table contains multiple years of data, rather than having a separate table for each year.

As you upload files, you will see an overview of any files' progress and can double click to view each file's data and additional information. Once all uploads have completed, you can inspect the table — representing the concatenation of all of your uploads — including summary statistics and other analytical information to help you validate that the data are as you expected.

See the data upload documentation for more information.

Creating samples

If your dataset is particularly large, or if you want to control access to a sample of the data separately from the whole dataset, you should configure sampling on your dataset. This will allow researchers to work with a 1% sample of the data during initial exploration, and allow you to grant access to the sample independently of the full dataset.

To update the dataset's sample configuration, click on any table, and then click "Configure sample". When configuring the sample, you can generate a random sample for each table, or sample on a particular variable that is common across tables. If researchers will be joining tables across your dataset, it is highly recommended that you sample on that common join variable so that researchers can contain a consistent 1% sample as they work with your data.

See the dataset sampling documentation for more information.

Create sample tables for your dataset to speed up exploration and customize data access

3. Edit metadata

It's easy to feel "done" after ingesting your data, but documentation and metadata are essential to the usability of your dataset. Moreover, rich metadata will improve the discoverability of your dataset by providing more information and terms to the Redivis search engine.

Metadata exist at three levels — on the dataset itself, on the dataset's tables, and on the tables' variables.

Metadata can always be updated after your dataset has been released. While good metadata are essential, it can be a time consuming and iterative process — you might prefer to provide some basic content initially, and then have it improve over time.

Need help with from others? You can add users as dataset editors to help populate your dataset.

Dataset metadata

On the overview tab of the dataset, you can provide a description, detailed documentation, and content tags for the dataset. The description should be a brief overview of the dataset, while the documentation supports a rich text editor complete with embedded images and accompanying files (such as PDF documentation). Most of this information will be visible to anyone with overview access, though you can create documentation sections that require a higher level of access.

Detailed documentation is key for others to discover and understand your dataset

Table metadata

To help users understand what each table represents, you should update the description, entity, and temporal range for each table in the dataset. The entity should define what each row in a table represents — is it a person? an event? a charge? The temporal range can be tied to a specific variable (using the min/max of that variable), or defined explicitly.

Click on a table to edit its description, entity, and temporal range

Variable metadata

The tables in your dataset are made up of named variables, though rarely is this name enough to understand what the variable measures. On any table, click "Edit variable metadata" to order to populate the variable metadata.

On each variable, Redivis supports a label, description, and value labels. The label is the most essential item, think of it as a more human-readable variable name. The description should contain more detailed information, everything from caveats and notes to collection methodology. Value labels are only applicable when the variable is encoded with keys (often integers or short strings) that map to the actual value — for example, a survey might be encoded as 0: "No" 1: "Yes" 2: "Don't know" 3: "Declined to answer".

Editing variable metadata can be a tedious process, but Redivis does support the ability to import metadata from a file, and will also automatically extract metadata if its present in the uploaded data files (e.g., Stata or SAS upload types). Additionally, if your dataset has multiple tables

Update variable metadata to help researchers understand the data content

4. Configure access

Before releasing your dataset, it is important to define who can access the dataset and what the procedures are for applying and gaining access. Click the "Configure access" button on the right side of the dataset editor page you can set up your dataset's access configuration.

Click "Configure access" to define your access rules and limitations for your dataset

Access levels

Dataset access can be controlled on five levels:

  1. Overview: the ability to see a dataset and its documentation.

  2. Metadata: the ability to view variable names and univariate summary statistics.

  3. Sample: the ability to view and query a dataset's 1% sample. This will only be available for datasets that have a sample configured.

  4. Data: the ability to view and query a dataset's tables.

  5. Edit: the ability to edit the dataset and release new versions.

For a detailed breakdown of the content available at each tier, see the access level documentation.

Access levels are cumulative. For example, in order to gain data access a user will need to have metadata access as well.

Usage rules

Even with data access, you may want to limit what other users can do with your dataset. Currently, you can configure export restrictions that limit:

  • The download location (e.g., to prevent researchers from downloading to their personal computer)

  • The download size, in bytes and/or rows

  • Enforce admin approval before any export

Learn more in the dataset usage rules documentation.

Editors

You may also add additional dataset editors to help upload data and provide metadata content. These editors will be able to create and release new versions, and will have full access to the underlying data, though they cannot add other users, modify the access configuration, or bypass the dataset usage rules.

If the dataset is hosted by an organization, all administrators of the organization will be able to the edit the dataset as well as its access configuration.

Permission groups

If the dataset is hosted by an organization, you will have additional options for configuring access to the dataset. The dataset can be assigned to a permission group to help standardize access procedures, and this permission group can contain requirements that help data managers fulfill contractual requirements and gather relevant information about the research being done on the dataset.

Learn more in the setting up an organization guide and permission group documentation.

5. Release the dataset

Congratulations! Your dataset is ready to be released and utilized by the research community. But first — it is highly recommended that you validate and audit your dataset beforehand. Take a look at the number of rows, variables, and uploads in each table. Validate some of the variable summary statistics against what you expect. And to be truly thorough, add the dataset to a project and run some queries as if you were a researcher. Catching a mistake now will prevent headaches down the line if researchers uncover unexpected discrepancies in the data.

Once a version has been released, the data can no longer be edited — you'll need to release a new version to modify the data. However, you can continue to edit metadata and documentation after a version has been released.

When you're confident that you're ready to go, click the "Release" button on the top right of the dataset editor. If the button is disabled, hover over it to understand what issues are currently preventing you from releasing.

After clicking the button, you'll be presented with a final checklist of tasks. When you click the Release version button, the dataset will be immediately released and available to all users with access.

Review the final release checklist by clicking the "Release" button

6. Make updates as new versions

Once a dataset is released, you can return to it to make changes at any time. Changes to datasets are tracked in Redivis as versions. Anyone with access to a dataset can view and work with any of its versions.

How to work with versions when updating a dataset:

  • Any edits to the data content in tables will need to be released as a new version.

  • Edits to the dataset information, table information, or variable metadata can be made on the current version (or historic versions) and will be live as soon as it's saved.

  • Edits to the dataset name and access configuration will always affect all versions.

Creating the next version

All data within a dataset is encapsulated in discrete, immutable versions. Every part of the dataset except for the name and access settings are versioned. All tables in a dataset are versioned together.

After releasing the first version of the dataset, you can choose to create a new version at any time by clicking the button in the top right "Create next version". This version will be created as vNext, and you may toggle between this and historic versions at any time.

Subsequent versions always build on the previous version of the dataset, and changes made in the next version will have no affect on previous versions. Alongside modifications to the dataset's metadata, You may create, update, or delete any of the previous version's tables

Replacing vs appending data

When uploading data to a previous table, you can choose whether you want to append these new uploads to your existing data, or replace the entire table with the new data.

There is a maximum of 4000 versions allowed per dataset. This limit may be lower if you regularly replace (rather than append) data, depending on how that the replaced data overlaps with existing data. In such cases, it is reasonable to expect that the maximum version count will be at least 1000.

Version storage costs

Redivis computes row-level diffs for each version, efficiently storing the complete version history in one master table. This allows you to regularly release new versions and maintain a robust version history without ballooning storage costs. Learn more about storage limits for users and storage pricing for organizations in the reference material.