Datasets are a core component of Redivis. Consisting of documentation, metadata, and tables, datasets allow you to store, version, and distribute a wide variety of data.
Users can create their own dataset in their workspace, and organization administrators can upload one on behalf of an organization via the administrator panel. Of course, Redivis is already home to thousands of datasets – you can apply for access, create projects, and export and visualize data without ever uploading your own.
All datasets must have a name (unique to datasets for the user / organization). You can also provide a description, additional documentation, and tags that will show up on the dataset's overview page. This information can also be edited at any later point.
All data is associated with a table and each dataset can have one or more tables. While you may release a dataset without any tables, this will be of limited use to other researchers, as Redivis provides numerous tools for understanding, querying, and generally working with tabular data.
To get started, create a table (on the "Tables" tab of the dataset) and give it a name.
If the data that you're uploading is split across multiple files with the same general variable structure (e.g., one file per year) you can upload them all together into one table and they will automatically be appended based on common variable (column) names. If a variable is missing in some of the files, that's ok, it will just be recorded as
null for all records in that file.
If your dataset has multiple files with a different variable structure, you will want to create multiple tables to upload each to separately.
As you upload files, you will see an overview of any files' progress and can double click to view each file's data and additional information. Once all uploads have completed, you can inspect the table — representing the concatenation of all of your uploads — including summary statistics and other analytical information to help you validate that the data are as you expected.
See the data upload documentation for more information.
If your dataset is particularly large, or if you want to control access to a sample of the data separately from the whole dataset, you should configure sampling on your dataset. This will allow researchers to work with a 1% sample of the data during initial exploration, and allow you to grant access to the sample independently of the full dataset.
To update the dataset's sample configuration, click on any table, and then click "Configure sample". When configuring the sample, you can generate a random sample for each table, or sample on a particular variable that is common across tables. If researchers will be joining tables across your dataset, it is highly recommended that you sample on that common join variable so that researchers can contain a consistent 1% sample as they work with your data.
See the dataset sampling documentation for more information.
It's easy to feel "done" after ingesting your data, but documentation and metadata are essential to the usability of your dataset. Moreover, rich metadata will improve the discoverability of your dataset by providing more information and terms to the Redivis search engine.
Metadata exist at three levels — on the dataset itself, on the dataset's tables, and on the tables' variables.
On the overview tab of the dataset, you can provide a description, detailed documentation, and content tags for the dataset. The description should be a brief overview of the dataset, while the documentation supports a rich text editor complete with embedded images and accompanying files (such as PDF documentation). Most of this information will be visible to anyone with overview access, though you can create documentation sections that require a higher level of access.
To help users understand what each table represents, you should update the description, entity, and temporal range for each table in the dataset. The entity should define what each row in a table represents — is it a person? an event? a charge? The temporal range can be tied to a specific variable (using the min/max of that variable), or defined explicitly.
The tables in your dataset are made up of named variables, though rarely is this name enough to understand what the variable measures. On any table, click "Edit variable metadata" to order to populate the variable metadata.
On each variable, Redivis supports a label, description, and value labels. The label is the most essential item, think of it as a more human-readable variable name. The description should contain more detailed information, everything from caveats and notes to collection methodology. Value labels are only applicable when the variable is encoded with keys (often integers or short strings) that map to the actual value — for example, a survey might be encoded as
2: "Don't know"
3: "Declined to answer".
Editing variable metadata can be a tedious process, but Redivis does support the ability to import metadata from a file, and will also automatically extract metadata if it's present in the uploaded data files (e.g., Stata or SAS upload types).
Before releasing your dataset, it is important to define who can access the dataset and what the procedures are for applying and gaining access. Click the "Configure access" button on the right side of the dataset editor page you can set up your dataset's access configuration.
Dataset access can be controlled on five levels:
Overview: the ability to see a dataset and its documentation.
Metadata: the ability to view variable names and univariate summary statistics.
Sample: the ability to view and query a dataset's 1% sample. This will only be available for datasets that have a sample configured.
Data: the ability to view and query a dataset's tables.
Edit: the ability to edit the dataset and release new versions.
For a detailed breakdown of the content available at each tier, see the access level documentation.
Even with data access, you may want to limit what other users can do with your dataset. Currently, you can configure export restrictions that limit:
The download location (e.g., to prevent researchers from downloading to their personal computer)
The download size, in bytes and/or rows
Enforce admin approval before any export
Learn more in the dataset usage rules documentation.
You may also add additional dataset editors to help upload data and provide metadata content. These editors will be able to create and release new versions, and will have full access to the underlying data, though they cannot add other users, modify the access configuration, or bypass the dataset usage rules.
If the dataset is hosted by an organization, all administrators of the organization will be able to the edit the dataset as well as its access configuration.
If the dataset is hosted by an organization, you will have additional options for configuring access to the dataset. The dataset can be assigned to a permission group to help standardize access procedures, and this permission group can contain requirements that help data managers fulfill contractual requirements and gather relevant information about the research being done on the dataset.
Congratulations! Your dataset is ready to be released and utilized by the research community. But first — it is highly recommended that you validate and audit your dataset beforehand. Take a look at the number of rows, variables, and uploads in each table. Validate some of the variable summary statistics against what you expect. And to be truly thorough, add the dataset to a project and run some queries as if you were a researcher. Catching a mistake now will prevent headaches down the line if researchers uncover unexpected discrepancies in the data.
When you're confident that you're ready to go, click the "Release" button on the top right of the dataset editor. If the button is disabled, hover over it to understand what issues are currently preventing you from releasing.
After clicking the button, you'll be presented with a final checklist of tasks. When you click the Release version button, the dataset will be immediately released and available to all users with access.
Once a dataset is released, you can return to it to make changes at any time. Changes to datasets are tracked in Redivis as versions. Anyone with access to a dataset can view and work with any of its versions.
How to work with versions when updating a dataset:
Any edits to the data content in tables will need to be released as a new version.
Edits to the dataset information, table information, or variable metadata can be made on the current version (or historic versions) and will be live as soon as it's saved.
Edits to the dataset name and access configuration will always affect all versions.
All data within a dataset is encapsulated in discrete, immutable versions. Every part of the dataset except for the name and access settings are versioned. All tables in a dataset are versioned together.
After releasing the first version of the dataset, you can choose to create a new version at any time by clicking the button in the top right "Create next version". This version will be created as
vNext, and you may toggle between this and historic versions at any time.
Subsequent versions always build on the previous version of the dataset, and changes made in the next version will have no affect on previous versions. Alongside modifications to the dataset's metadata, you may create, update, or delete any of the previous version's tables.
When uploading data to a previous table, you can choose whether you want to append these new uploads to your existing data, or replace the entire table with the new data.
Redivis computes row-level diffs for each version, efficiently storing the complete version history in one master table. This allows you to regularly release new versions and maintain a robust version history without ballooning storage costs. Learn more about storage limits for users and storage pricing for organizations in the reference material.