Editing metadata

Each table has its own set of characteristics along with metadata associated with the data it contains. You can edit all this information on the Tables tab.

Table characteristics

Tables by default need a name and can also be described further to help users better understand them. These characteristics are displayed on the right side of the table and you can click on any one of them to edit it.

Entity

This documents the concept that one record in this table represents. For example, the table's entity might represent a unique patient, or a specific hospitalization, or a prescription. These are filterable on your organization home page, so the more standardized you can keep them across tables, the easier time users will have exploring your data.

Temporal range

This required field represents the time range that is contained by the records in this dataset. This will be displayed directly on the table. The temporal range can either be an integer (year), date, or dateTime.

If there is a variable in the dataset which represents the temporal range in the data, select that variable from the dropdown menu and the range will be automatically calculated.

If there is no representative variable you can enter the time range manually by selecting "Set range manually."

If this dataset does not cover any range of time, you can select "Undefined"

Sampling

If you think it's appropriate you can generate a 1% sample for this version. If a sample exists for at least one table in a dataset, when users add the dataset to a project it will automatically default to the sampled state to increase querying speed.

No sample

For smaller datasets, it likely won't make sense to create a 1% sample, and you can select No sample.

Random sample

Randomly chooses records to be in the sample table (each record has a 1% chance of being sampled). Note that this is non-deterministic; the 1% sample of two identical tables won't be the same. Your table must have at least 1,000 records in order to be sampled.

Sample on a variable

You can select any variable in the dataset with at least 158,000 unique values to sample on. Every value for this variable will have a 1% chance of being in the output set; importantly, this sampling is deterministic. This guarantees that the same values that fall in the 1% sample for one table will also occur in the 1% sample for another table in the same dataset.

For example, given two tables that contain a variable patient_id, if this variable is sampled upon, it is guaranteed that the same patient_id's will be included in both tables' samples, and if a patient_id is in the sample table, all records with that patient_id from the table will be in the sample. This ensures that joins on patient_id across 1% samples will reflect a consistent 1% sample of the data.

Note that the sample will be computed on the string representation of the variable. For example, if the value '1234' falls in the 1% sample, then we are guaranteed that the integer value 1234 will also fall within the sample. However, if this value is stored as a float (1234.0), it is unlikely to also fall in the sample, as the string representation of this float is '1234.0', which for the purposes of sampling is entirely different than the string '1234'.

Metadata

For many datasets, providing strong metadata is just as essential as uploading and cleaning the raw data. We strongly encourage you to provide metadata whenever possible.

Redivis supports variable level metadata in the form of variable labels, descriptions, and value labels.

Editing metadata

To edit a table's variable metadata, click the Edit metadata button on the right of any table. You can edit just this table's metadata or update all of them in bulk on the "All files" tab.

In the metadata editor, you can modify a variable's metadata: its label, description, and value labels. Edit the fields as you would in any spreadsheet, and save changes when you're done.

Uploading a metadata file

To apply metadata in bulk, you can upload a file containing metadata information directly from your computer. This file can either be a CSV or JSON.

CSV metadata format

The CSV should be formatted, without any header, as:

variable_name,variable_label,variable_description,value_1,value_label_1,...value_n,value_label_n
variable2_name,variable2_label,etc...

For example:

sex,patient sex,patient's recorded sex,1,male,2,female
id,patient_identifier,unique patient identifier

JSON metadata format

// JSON format is an array of objects, with each object representing a variable
[
{
"name": str,
"label": str,
"description": str,
"valueLabels": [
{
"value": str,
"label": str
}
]
}
]==

‚Äč