Sampling

Overview

For datasets with large tables, it is often a good idea to include a 1% sample of the data, supporting faster exploratory queries as new researchers work to understand your data. Moreover, if a sample is configured, you will have the ability to control access to that sample separately from the full dataset.

Sampling is applied independently to each version of a dataset. You may modify the sampling methodology on a version at any time — even after it's been released — though keep in mind that this may affect researchers that are currently working with the dataset sample. As a best practice, it's good to configure and validate your sample before releasing a new version.

To configure sampling on your dataset, click the Configure sample button on the Tables tab of a dataset page.

Random sample

The simplest form of sampling, this will create a corresponding sample for every table in the dataset (including file index tables). Every record / file will have a 1% chance of occurring in the sample.

As a general rule, you should only use random samples if:

You have one table in your dataset, or
Researchers won't be joining multiple tables in your dataset together

If this isn't the case, consider sampling on a specific variable. Otherwise, as researchers join different tables together, they will start getting samples of a sample, since there is no consistent cohort of records between tables.

Sampling on a variable

For situations when you want researchers to be able to join tables within your dataset, consider generating a sample on a variable that exists in at least some of the tables in your dataset. Every value for this variable will have a 1% chance of being in the output set.

Importantly, this sampling is deterministic. This guarantees that the same values that fall in the 1% sample for one table will also occur in the 1% sample for another table in the same dataset. In fact, these sampled values will be consistent across Redivis, allowing researchers to even merge samples across datasets.

Note that the sample will be computed on the string representation of the variable. For example, if the value '1234' falls in the 1% sample, then we are guaranteed that the integer value 1234 will also fall within the sample. However, if this value is stored as a float (1234.0), it is unlikely to also fall in the sample, as the string representation of this float is '1234.0', which for the purposes of sampling is entirely different than the string '1234'.

When sampling on a variable, only tables with that variable will be sampled. This is useful for the case when some tables contain supplementary information to your primary cohort. For example, consider the case when your dataset has a "Patients" table, a "Hospitalizations" table, and a "Hospitals" table. We'd likely want to create a sample on the patient_id variable, which would create a 1% subset of patients and the corresponding hospitalizations for those patients. However, this wouldn't create a sample on the "Hospitals" table — which is what we want, given that the sample of patients is still distributed across a large number of hospitals.

If your dataset contains unstructured data files, you probably want to sample on either the file_name or file_id variables.

If only some of the dataset's tables are sampled, users with sample access to the dataset will have data access to the sampled tables and data access to the unsampled tables. While this is likely necessary for researchers to meaningfully work with the dataset sample (see paragraph above), it may have ramifications for how you configure your access rules.

Learn more about controlling sample access in the data access reference.

Last updated 8 months ago

Was this helpful?