For larger datasets, it is often a good idea to include a 1% sample of the data, supporting faster exploratory queries as new researchers work to understand your data. Moreover, if a sample is configured, you will have the ability to control access to that sample separately from the full dataset.
Sampling is applied independently to each version of a dataset. You may modify the sampling methodology on a version at any time — even after it's been released — though keep in mind that this may affect researchers that are currently working with the dataset sample. As a best practice, it's good to configure and validate your sample before releasing a new version.
To configure sampling on your dataset, click the Configure sample link on any of the dataset's tables.
The simplest form of sampling, this will create a corresponding sample for every table in the dataset. Every record will have a 1% chance of occurring in the sample.
As a general rule, you should only use random samples if:
You have one table in your dataset, or
Researchers won't be joining multiple tables in your dataset together
If this isn't the case, consider sampling on a specific variable. Otherwise, as researchers join different tables together, they will start getting samples of a sample, since there is no consistent cohort of records between tables.
For situations when you want researchers to be able to join tables within your dataset, consider generating a sample on a variable that exists in at least some of the tables in your dataset. Every value for this variable will have a 1% chance of being in the output set.
Importantly, this sampling is deterministic. This guarantees that the same values that fall in the 1% sample for one table will also occur in the 1% sample for another table in the same dataset. In fact, these sampled values will be consistent across Redivis, allowing researchers to even merge samples across datasets.
When sampling on a variable, only tables with that variable will be sampled. This is useful for the case when some tables contain supplementary information to your primary cohort. For example, consider the case when your dataset has a "Patients" table, a "Hospitalizations" table, and a "Hospitals" table. We'd likely want to create a sample on the
patient_id variable, which would create a 1% subset of patients and the corresponding hospitalizations for those patients. However, this wouldn't create a sample on the "Hospitals" table — which is what we want, given that the sample of patients is still distributed across a large number of hospitals.