Redivis Documentation
API DocumentationRedivis Home
  • Introduction
  • Redivis for open science
    • FAIR data practices
    • Open access
    • Data repository characteristics
    • Data retention policy
    • Citations
  • Guides
    • Getting started
    • Discover & access data
      • Discover datasets
      • Apply to access restricted data
      • Create a study
    • Analyze data in a workflow
      • Reshape data in transforms
      • Work with data in notebooks
      • Running ML workloads
      • Example workflows
        • Analyzing large tabular data
        • Create an image classification model
        • Fine tuning a Large Language Model (LLM)
        • No-code visualization
        • Continuous enrollment
        • Select first/last encounter
    • Export & publish your work
      • Export to other environments
      • Build your own site with Observable
    • Create & manage datasets
      • Create and populate a dataset
      • Upload tabular data as tables
      • Upload unstructured data as files
      • Cleaning tabular data
    • Administer an organization
      • Configure access systems
      • Grant access to data
      • Generate a report
      • Example tasks
        • Emailing subsets of members
    • Video guides
  • Reference
    • Your account
      • Creating an account
      • Managing logins
      • Single Sign-On (SSO)
      • Workspace
      • Studies
      • Compute credits and billing
    • Datasets
      • Documentation
      • Tables
      • Variables
      • Files
      • Creating & editing datasets
      • Uploading data
        • Tabular data
        • Geospatial data
        • Unstructured data
        • Metadata
        • Data sources
        • Programmatic uploads
      • Version control
      • Sampling
      • Exporting data
        • Download
        • Programmatic
        • Google Data Studio
        • Google Cloud Storage
        • Google BigQuery
        • Embedding tables
    • Workflows
      • Workflow concepts
      • Documentation
      • Data sources
      • Tables
      • Transforms
        • Transform concepts
        • Step: Aggregate
        • Step: Create variables
        • Step: Filter
        • Step: Join
        • Step: Limit
        • Step: Stack
        • Step: Order
        • Step: Pivot
        • Step: Rename
        • Step: Retype
        • Step: SQL query
        • Variable selection
        • Value lists
        • Optimization and errors
        • Variable creation methods
          • Common elements
          • Aggregate
          • Case (if/else)
          • Date
          • DateTime
          • Geography
          • JSON
          • Math
          • Navigation
          • Numbering
          • Other
          • Statistical
          • String
          • Time
      • Notebooks
        • Notebook concepts
        • Compute resources
        • Python notebooks
        • R notebooks
        • Stata notebooks
        • SAS notebooks
        • Using the Jupyter interface
      • Access and privacy
    • Data access
      • Access levels
      • Configuring access
      • Requesting access
      • Approving access
      • Usage rules
      • Data access in workflows
    • Organizations
      • Administrator panel
      • Members
      • Studies
      • Workflows
      • Datasets
      • Permission groups
      • Requirements
      • Reports
      • Logs
      • Billing
      • Settings and branding
        • Account
        • Public profile
        • Membership
        • Export environments
        • Advanced: DOI configuration
        • Advanced: Stata & SAS setup
        • Advanced: Data storage locations
        • Advanced: Data egress configuration
    • Institutions
      • Administrator panel
      • Organizations
      • Members
      • Datasets
      • Reports
      • Settings and branding
    • Quotas and limits
    • Glossary
  • Additional Resources
    • Events and press
    • API documentation
    • Redivis Labs
    • Office hours
    • Contact us
    • More information
      • Product updates
      • Roadmap
      • System status
      • Security
      • Feature requests
      • Report a bug
Powered by GitBook
On this page
  • Overview
  • Working with data
  • Notebook management
  • Dependencies
  • Files
  • Version history
  • Access rules
  • Exporting notebooks

Was this helpful?

Export as PDF
  1. Reference
  2. Workflows
  3. Notebooks

Notebook concepts

Last updated 3 months ago

Was this helpful?

Overview

Notebooks provide a highly flexible compute environment for working with data on Redivis. In a notebook, you can in your workflow, , perform analyses in , store and download files, and generate an output table for downstream analysis.

Transforms vs. notebooks?

There are two mechanisms for working with data in workflows: and . Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.

Transforms are better for:

  • Reshaping + combining tabular and geospatial data

  • Working with large tables, especially at the many GB to TB scale

  • Preference for a no-code interface, or preference for programming in SQL

  • Declarative, easily documented data operations

Notebooks are better for:

  • Interactive exploration of any data type, including unstructured data files

  • Working with smaller tables (though working with bigger data is possible)

  • Preference for Python, R, Stata, or SAS

  • Interactive visualizations and figure generation

Working with data

Loading data

# Reference the source table with the special "_source_" identifier:
df = redivis.table("_source_").to_pandas_dataframe()

# Reference any other table via its name:
# The last 4 characters are the reference id. This is optional, 
#     but recommended to ensure the notebook works as tables get renamed.
df2 = redivis.table("daily_observations:vdwn").to_pandas_dataframe()

# If our table is a file index table, we can load and process those files
for f in redivis.table("_source_").list_files():
    data = f.read()
# Reference the source table with the special "_source_" identifier:
df <- redivis$table("_source_")$to_tibble()

# Reference any other table via its name:
# The last 4 characters are the reference id. This is optional, 
#     but recommended to ensure the notebook works as tables get renamed.
df2 = redivis$table("daily_observations:vdwn")$to_tibble()

# If our table is a file index table, we can load and process those files
for (f in redivis$table("_source_")$list_files()){
    data = f.read()
}
# We first load the table via python, and then pass the dataframe into stata
df = redivis.table("_source_").to_pandas_dataframe(dtype_backend="numpy")
%%stata -d df -force
/* Run stata code! All stata cells must be prefixed with %%stata */
describe()
import saspy
import redivis

sas_session = saspy.SASsession()

# We first load the table via python, and then pass it into SAS
df = redivis.table("_source_").to_pandas_dataframe(dtype_backend="numpy")

# Load the table into SAS, giving it the name "df"
sas_data = sas_session.df2sd(df, table="df")
%%SAS sas_session
/* 
    Run SAS code! All SAS cells must be prefixed with %%SAS,
    and reference the sas_session variable
*/
proc print data=df(obs=5);
run;

Analyzing data

Redivis notebooks support the following kernels (programming languages). For more details and examples on how to use notebooks in each language, consult the language-specific documentation:

Outputting tables

A notebook can generate an output table as the result of its execution. This output table is created programmatically, e.g.:

# Multiple types for "df" are supported
# Consult the language-specific docs for more info
df = get_dataframe_somehow()

redivis.current_notebook().create_output_table(df)
# Multiple types for "df" are supported
# Consult the language-specific docs for more info
df <- get_dataframe_somehow()

redivis$current_notebook()$create_output_table(df)
%%stata -doutd df2
/*
  Once this cell executes, the current dataset will be pushed 
  to the python variable df2
*/
rename v* newv*
# Via python, pass this dataframe to the output table
redivis.current_notebook().create_output_table(df2)
# Reference the table table named "some_table" in SAS
sas_table = sas_session.sasdata("some_table")

# Convert the sas_table to a pandas dataframe
df = sas_table.to_df()

redivis.current_notebook().create_output_table(df)

Storing files

As you perform your analysis, you may generate files that are stored on the notebook's hard disk. There are two locations that you should write files to: /out for persistent storage, and /scratch for temporary storage.

Any files written to persistent storage will be available when the notebook is stopped, and will be restored to the same state when the notebook is run again. Alternatively, any files written to temporary storage will only exist for the duration of the current notebook session.

To write files to these directories, use the standard tools of your programming language for writing files. E.g.,:

df = get_dataframe_somehow()

# Write temporary files to /scratch, and files you want to persist to /out:
df.to_csv("/out/data.csv")
df <- get_dataframe_somehow()

write.csv(df, "/out/data.csv", na="")
%%stata
save "/out/my_dataset.dta"
%%SAS sas_session
proc export data=datasetname
  outfile='/out/filename.csv'
  dbms=csv
  replace;
run;

Notebook management

Creation

Starting and stopping

Compute configuration

Persistence

All notebooks are automatically saved as you go. Every time a notebook is stopped, all cell inputs are saved to the notebook version history, giving you a historical record of all code that was run. Additionally, all cell outputs from the last notebook session will be preserved, as will any files written to the /out directory.

Clearing outputs

When starting a notebook, you'll be presented with the option to "Clear all outputs and start". This can be helpful in that it will reset all access rules associated with the notebook, since there is no longer any data associated with the notebook.

Choosing this option will clear all output cells in your notebook, any files saved in the /out directory, and any output tables from the notebook.

Logs

You can click the three-dot More menu to open the logs for this notebook. Opening the logs when a notebook is stopped will show the logs from the notebook's previous run.

Lifecycle

Activity is determined based on the Jupyter kernel — if you have a long-running computation, the notebook will be considered as active for the entire time.

Collaboration

All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them. Workflow viewers will see a read-only version of the notebook.

Changing the source table

To change a notebook's primary source table, either right-click on the notebook or click the three-dot (âµ—) icon and select the "change source table" option.

Limitations

Dependencies

All notebooks come with a number of common packages pre-installed. You can install additional packages by clicking the Edit dependencies button in the notebook start modal or toolbar.

For more detailed information about the default dependencies and adding new packages, consult the documentation for your notebook type:

For notebooks that reference restricted data, internet will be disabled while the notebook is running. This means that the dependencies interface is the only place from which you can install dependencies – e.g., running pip install for python or devtools::install() for R within your notebook will fail.

Moreover, it is strongly recommended to always install your dependencies through the dependencies interface (regardless of whether your notebook has internet access), as this provides better reproducibility and documentation for future use.

Files

Moreover, files written to the /out directory are always available, and will persist across notebook sessions. This allows for workflows where you can cache certain results between notebook sessions, avoiding the need to rerun time-intensive computations.

The files in the /scratch directory are only available when the notebook is running, and will be cleared once it is stopped. The default "working directory" of all notebooks is /scratch – this is where files will be written if you do not specify another location.

You can view the files in either directory by pressing the Files button at the top right of the notebook.

Version history

Every time you stop your notebook, all cell inputs (your code and markdown) will be saved and associated with that notebook session. You can view the code from all previous sessions by pressing the History button at the top right of your notebook, allowing you to view and share your code as it was any previous point in time.

Access rules

Determining notebook access

Access levels

In order to view a notebook, you must first have view access to the corresponding workflow, and in order to run and edit the notebook, you must also have edit access to that workflow.

External internet access

If a notebook contains data with export restrictions, access to the external internet will be disabled while the notebook is running.

When the internet is disabled in a notebook you can still specify packages and other startup scripts in the Dependencies modal that will be installed on notebook start. Additionally, if any of your packages require internet access to run, you'll need to attempt to "preload" any content using a post-install script. For example, if you're using the tidycensus package in R, you could preload content as follows:

R -e '
  library(tidycensus)
  library(tidyverse)

  census_api_key("YOUR API KEY GOES HERE")
  get_decennial(geography = "state", 
                 variables = "P13_001N", 
                 year = 2020,
                 sumfile = "dhc")
'

Downloading files

Exporting notebooks

Notebooks can be downloaded as PDF, HTML, and .ipynb files by clicking the three-dot More button at the top right of the notebook.

You will be given the option of whether to include cell outputs in your export — it is important that you ensure the outputs displayed in your notebook do not contain sensitive data, and that your subsequent distribution is in compliance with any data use agreements.

From within your notebook, you can load any data available in your workflow. You can reference the primary source table of the notebook via the special _source_ identifier, or reference any other table in the workflow by its name. To ensure that your notebook doesn't break when tables get renamed, make sure to use the for non-primary tables. For example:

You can these files anytime.

Create a notebook by clicking on a in a workflow and selecting + Notebook. This table will become the default source table for your new notebook and will have pre-generated code that references the table's data.

Notebook nodes need to be started in order to edit or execute cells. Click the purple Start notebook button in the top right to start the notebook and provision compute resources. You can also elect to "Clear outputs and start", which will remove all outputs and reset any in the notebook.

By default notebooks are provisioned with 32GB memory and 2 CPU cores, with compute power comparable to most personal computers. You can view and alter the notebook's in the More menu.

The default notebooks have a maximum lifetime of 6 hours, and after 30 minutes of inactivity, any running notebook will automatically be stopped. If you are use a notebook with , these values can be modified.

Notebooks are subject to certain .

Notebooks offer special capabilities for on the notebook's hard disk. Any files you've stored in a notebook's /out and /scratch directories will be available in the files modal. This modal can allow you to preview and download specific file outputs from your notebook.

You can list files in either directory by pressing the corresponding tab, and click on any file to view it. Redivis supports interactive previews for many in the file inspector, and you can also download the file for further inspection and analysis. To download all files in a directory, click the Download all button in the files modal.

Your access to a notebook is determined by your corresponding access to all tables (and their antecedent datasets) referenced by the notebook. These linkages persist across notebook sessions, as a future session could reference data from a previous session. In order to reset the tables referenced by your notebook, which will also clear all outputs in the notebook, you can choose to Clear outputs and start when .

Additionally, your access to a notebook is governed by your to its source tables. In order to run a notebook and see its outputs, you must have data access to all source tables. If you have metadata access, you will be able to see cell inputs in a notebook (that is, the code), but not outputs. If you only have overview (or no) access to the source tables, you will not be able to see notebook contents.

Typically, you will be able to download any files written to the notebooks /out or /scratch directories. However, if a notebook references data with export restrictions, you will not be able to download these files, unless the file size is smaller than the relevant specified on source datasets.

qualified reference
Python notebooks
R notebooks
Stata notebooks
SAS notebooks
table node
compute resources
concurrency and duration limits
access
transforms
notebooks
reference any table
install dependencies
Python, R, Stata, or SAS
inspect and download
referenced tables
files written to specific directories
starting the notebook
Stata base image & dependencies
R base image & dependencies
file types
paid custom compute
SAS base image & dependencies
size-based export restrictions
Python base image & dependencies