Redivis Documentation
API DocumentationRedivis Home
  • Introduction
  • Redivis for open science
    • FAIR data practices
    • Open access
    • Data repository characteristics
    • Data retention policy
    • Citations
  • Guides
    • Getting started
    • Discover & access data
      • Discover datasets
      • Apply to access restricted data
      • Create a study
    • Analyze data in a workflow
      • Reshape data in transforms
      • Work with data in notebooks
      • Running ML workloads
      • Example workflows
        • Analyzing large tabular data
        • Create an image classification model
        • Fine tuning a Large Language Model (LLM)
        • No-code visualization
        • Continuous enrollment
        • Select first/last encounter
    • Export & publish your work
      • Export to other environments
      • Build your own site with Observable
    • Create & manage datasets
      • Create and populate a dataset
      • Upload tabular data as tables
      • Upload unstructured data as files
      • Cleaning tabular data
    • Administer an organization
      • Configure access systems
      • Grant access to data
      • Generate a report
      • Example tasks
        • Emailing subsets of members
    • Video guides
  • Reference
    • Your account
      • Creating an account
      • Managing logins
      • Single Sign-On (SSO)
      • Workspace
      • Studies
      • Compute credits and billing
    • Datasets
      • Documentation
      • Tables
      • Variables
      • Files
      • Creating & editing datasets
      • Uploading data
        • Tabular data
        • Geospatial data
        • Unstructured data
        • Metadata
        • Data sources
        • Programmatic uploads
      • Version control
      • Sampling
      • Exporting data
        • Download
        • Programmatic
        • Google Data Studio
        • Google Cloud Storage
        • Google BigQuery
        • Embedding tables
    • Workflows
      • Workflow concepts
      • Documentation
      • Data sources
      • Tables
      • Transforms
        • Transform concepts
        • Step: Aggregate
        • Step: Create variables
        • Step: Filter
        • Step: Join
        • Step: Limit
        • Step: Stack
        • Step: Order
        • Step: Pivot
        • Step: Rename
        • Step: Retype
        • Step: SQL query
        • Variable selection
        • Value lists
        • Optimization and errors
        • Variable creation methods
          • Common elements
          • Aggregate
          • Case (if/else)
          • Date
          • DateTime
          • Geography
          • JSON
          • Math
          • Navigation
          • Numbering
          • Other
          • Statistical
          • String
          • Time
      • Notebooks
        • Notebook concepts
        • Compute resources
        • Python notebooks
        • R notebooks
        • Stata notebooks
        • SAS notebooks
        • Using the Jupyter interface
      • Access and privacy
    • Data access
      • Access levels
      • Configuring access
      • Requesting access
      • Approving access
      • Usage rules
      • Data access in workflows
    • Organizations
      • Administrator panel
      • Members
      • Studies
      • Workflows
      • Datasets
      • Permission groups
      • Requirements
      • Reports
      • Logs
      • Billing
      • Settings and branding
        • Account
        • Public profile
        • Membership
        • Export environments
        • Advanced: DOI configuration
        • Advanced: Stata & SAS setup
        • Advanced: Data storage locations
        • Advanced: Data egress configuration
    • Institutions
      • Administrator panel
      • Organizations
      • Members
      • Datasets
      • Reports
      • Settings and branding
    • Quotas and limits
    • Glossary
  • Additional Resources
    • Events and press
    • API documentation
    • Redivis Labs
    • Office hours
    • Contact us
    • More information
      • Product updates
      • Roadmap
      • System status
      • Security
      • Feature requests
      • Report a bug
Powered by GitBook
On this page
  • Overview
  • Random sample
  • Sampling on a variable

Was this helpful?

Export as PDF
  1. Reference
  2. Datasets

Sampling

Last updated 5 months ago

Was this helpful?

Overview

For datasets with large tables, it is often a good idea to include a 1% sample of the data, supporting faster exploratory queries as new researchers work to understand your data. Moreover, if a sample is configured, you will have the ability to separately from the full dataset.

Sampling is applied independently to each version of a dataset. You may modify the sampling methodology on a version at any time — even after it's been released — though keep in mind that this may affect researchers that are currently working with the dataset sample. As a best practice, it's good to configure and validate your sample before .

To configure sampling on your dataset, click the Configure sample button on the Tables tab of a dataset page.

Random sample

As a general rule, you should only use random samples if:

  • You have one table in your dataset, or

  • Researchers won't be joining multiple tables in your dataset together

If this isn't the case, consider sampling on a specific variable. Otherwise, as researchers join different tables together, they will start getting samples of a sample, since there is no consistent cohort of records between tables.

Sampling on a variable

For situations when you want researchers to be able to join tables within your dataset, consider generating a sample on a variable that exists in at least some of the tables in your dataset. Every value for this variable will have a 1% chance of being in the output set.

Importantly, this sampling is deterministic. This guarantees that the same values that fall in the 1% sample for one table will also occur in the 1% sample for another table in the same dataset. In fact, these sampled values will be consistent across Redivis, allowing researchers to even merge samples across datasets.

Note that the sample will be computed on the string representation of the variable. For example, if the value '1234' falls in the 1% sample, then we are guaranteed that the integer value 1234 will also fall within the sample. However, if this value is stored as a float (1234.0), it is unlikely to also fall in the sample, as the string representation of this float is '1234.0', which for the purposes of sampling is entirely different than the string '1234'.

When sampling on a variable, only tables with that variable will be sampled. This is useful for the case when some tables contain supplementary information to your primary cohort. For example, consider the case when your dataset has a "Patients" table, a "Hospitalizations" table, and a "Hospitals" table. We'd likely want to create a sample on the patient_id variable, which would create a 1% subset of patients and the corresponding hospitalizations for those patients. However, this wouldn't create a sample on the "Hospitals" table — which is what we want, given that the sample of patients is still distributed across a large number of hospitals.

If only some of the dataset's tables are sampled, users with sample access to the dataset will have data access to the sampled tables and data access to the unsampled tables. While this is likely necessary for researchers to meaningfully work with the dataset sample (see paragraph above), it may have ramifications for how you configure your access rules.

The simplest form of sampling, this will create a corresponding sample for every table in the dataset (including ). Every record / file will have a 1% chance of occurring in the sample.

If your dataset contains , you probably want to sample on either the file_name or file_id variables.

Learn more about controlling sample access in the .

unstructured data files
data access reference
file index tables
releasing a new version
control access to that sample