Notebook concepts

Overview

Notebooks provide a highly flexible compute environment for working with data on Redivis. In a notebook, you can reference any table in your workflow, install dependencies, perform analyses in Python, R, Stata, or SAS, store and download files, and generate an output table for downstream analysis.

Transforms vs. notebooks?

There are two mechanisms for working with data in workflows: transforms and notebooks. Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.

Transforms are better for:

Reshaping + combining tabular and geospatial data
Working with large tables, especially at the many GB to TB scale
Preference for a no-code interface, or preference for programming in SQL
Declarative, easily documented data operations

Notebooks are better for:

Interactive exploration of any data type, including unstructured data files
Working with smaller tables (though working with bigger data is possible)
Preference for Python, R, Stata, or SAS
Interactive visualizations and figure generation

Working with data

Loading data

From within your notebook, you can load any data available in your workflow. You can reference the primary source table of the notebook via the special _source_ identifier, or reference any other table in the workflow by its name. To ensure that your notebook doesn't break when tables get renamed, make sure to use the qualified reference for non-primary tables. For example:

# Reference the source table with the special "_source_" identifier:
df = redivis.table("_source_").to_pandas_dataframe()

# Reference any other table via its name:
# The last 4 characters are the reference id. This is optional, 
#     but recommended to ensure the notebook works as tables get renamed.
df2 = redivis.table("daily_observations:vdwn").to_pandas_dataframe()

# If our table is a file index table, we can load and process those files
for f in redivis.table("_source_").list_files():
    data = f.read()

# Reference the source table with the special "_source_" identifier:
df <- redivis$table("_source_")$to_tibble()

# Reference any other table via its name:
# The last 4 characters are the reference id. This is optional, 
#     but recommended to ensure the notebook works as tables get renamed.
df2 = redivis$table("daily_observations:vdwn")$to_tibble()

# If our table is a file index table, we can load and process those files
for (f in redivis$table("_source_")$list_files()){
    data = f.read()
}

# We first load the table via python, and then pass the dataframe into stata
df = redivis.table("_source_").to_pandas_dataframe(dtype_backend="numpy")

%%stata -d df -force
/* Run stata code! All stata cells must be prefixed with %%stata */
describe()

import saspy
import redivis

sas_session = saspy.SASsession()

# We first load the table via python, and then pass it into SAS
df = redivis.table("_source_").to_pandas_dataframe(dtype_backend="numpy")

# Load the table into SAS, giving it the name "df"
sas_data = sas_session.df2sd(df, table="df")

%%SAS sas_session
/* 
    Run SAS code! All SAS cells must be prefixed with %%SAS,
    and reference the sas_session variable
*/
proc print data=df(obs=5);
run;

Analyzing data

Redivis notebooks support the following kernels (programming languages). For more details and examples on how to use notebooks in each language, consult the language-specific documentation:

Python notebooks R notebooks Stata notebooks SAS notebooks

Outputting tables

A notebook can generate an output table as the result of its execution. This output table is created programmatically, e.g.:

# Multiple types for "df" are supported
# Consult the language-specific docs for more info
df = get_dataframe_somehow()

redivis.current_notebook().create_output_table(df)

# Multiple types for "df" are supported
# Consult the language-specific docs for more info
df <- get_dataframe_somehow()

redivis$current_notebook()$create_output_table(df)

%%stata -doutd df2
/*
  Once this cell executes, the current dataset will be pushed 
  to the python variable df2
*/
rename v* newv*

# Via python, pass this dataframe to the output table
redivis.current_notebook().create_output_table(df2)

# Reference the table table named "some_table" in SAS
sas_table = sas_session.sasdata("some_table")

# Convert the sas_table to a pandas dataframe
df = sas_table.to_df()

redivis.current_notebook().create_output_table(df)

Storing files

As you perform your analysis, you may generate files that are stored on the notebook's hard disk. There are two locations that you should write files to: /out for persistent storage, and /scratch for temporary storage.

Any files written to persistent storage will be available when the notebook is stopped, and will be restored to the same state when the notebook is run again. Alternatively, any files written to temporary storage will only exist for the duration of the current notebook session.

To write files to these directories, use the standard tools of your programming language for writing files. E.g.,:

df = get_dataframe_somehow()

# Write temporary files to /scratch, and files you want to persist to /out:
df.to_csv("/out/data.csv")

df <- get_dataframe_somehow()

write.csv(df, "/out/data.csv", na="")

%%stata
save "/out/my_dataset.dta"

%%SAS sas_session
proc export data=datasetname
  outfile='/out/filename.csv'
  dbms=csv
  replace;
run;

You can inspect and download these files anytime.

Notebook management

Creation

Create a notebook by clicking on a table node in a workflow and selecting + Notebook. This table will become the default source table for your new notebook and will have pre-generated code that references the table's data.

Starting and stopping

Notebook nodes need to be started in order to edit or execute cells. Click the purple Start notebook button in the top right to start the notebook and provision compute resources. You can also elect to "Clear outputs and start", which will remove all outputs and reset any referenced tables in the notebook.

Run notebooks in the background

When starting a notebook, you can elect the option to "Run in background". This will run a notebook similar to a transform, where all code is executed in series, and the notebook stops once all cells have been run, or an error occurs. This can be helpful for quickly re-running notebooks after upstream changes have been made.

Server-side execution (alpha)

When starting a notebook, you will see an option to enable server-side execution. This allows the Jupyter notebook to keep receiving outputs even if your browser is closed or disconnected, which can be particularly helpful for long-running operations. This is a new feature within Jupyter notebooks, and still has a few rough edges (e.g., progress bars often don't display), so it is currently an opt-in feature. In the future, this will become the default behavior for all notebooks.

Compute configuration

By default notebooks are provisioned with 32GB memory and 2 CPU cores, with compute power comparable to most personal computers. You can view and alter the notebook's compute resources in the More menu.

Persistence

All notebooks are automatically saved as you go. Every time a notebook is stopped, all cell inputs are saved to the notebook version history, giving you a historical record of all code that was run. Additionally, all cell outputs from the last notebook session will be preserved, as will any files written to the /out directory.

Clearing outputs

When starting a notebook, you'll be presented with the option to "Clear all outputs and start". This can be helpful in that it will reset all access rules associated with the notebook, since there is no longer any data associated with the notebook.

Choosing this option will clear all output cells in your notebook, any files saved in the /out directory, and any output tables from the notebook.

Logs

You can click the three-dot More menu to open the logs for this notebook. Opening the logs when a notebook is stopped will show the logs from the notebook's previous run.

Lifecycle

The default notebooks have a maximum lifetime of 6 hours, and after 30 minutes of inactivity, any running notebook will automatically be stopped. If you are use a notebook with paid custom compute, these values can be modified.

Activity is determined based on the Jupyter kernel — if you have a long-running computation, the notebook will be considered as active for the entire time.

Collaboration

All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them. Workflow viewers will see a read-only version of the notebook.

Changing the source table

To change a notebook's primary source table, either right-click on the notebook or click the three-dot (ⵗ) icon and select the "change source table" option.

Limitations

Notebooks are subject to certain concurrency and duration limits.

Dependencies

All notebooks come with a number of common packages pre-installed. You can install additional packages by clicking the Edit dependencies button in the notebook start modal or toolbar.

For more detailed information about the default dependencies and adding new packages, consult the documentation for your notebook type:

For notebooks that reference restricted data, internet will be disabled while the notebook is running. This means that the dependencies interface is the only place from which you can install dependencies – e.g., running pip install for python or devtools::install() for R within your notebook will fail.

Moreover, it is strongly recommended to always install your dependencies through the dependencies interface (regardless of whether your notebook has internet access), as this provides better reproducibility and documentation for future use.

Secrets

Secrets are simple key/value pairs that are securely stored within an organization or under your account. These secrets can then be loaded in a notebook – a common use case is for storing external API tokens that then enable you to interface with these APIs from within your notebook.

Secrets are accessed via the Python or R client libraries:

# secrets are case-sensitive
secret = redivis.user("my_username").secret("EXTERNAL_API_TOKEN")
# Or for organization secrets:
# secret = redivis.organization("organization_name").secret("...")

# Avoid printing secrets, but rather pass them to other methods as needed
make_http_request("https://example.com/api", auth_token=secret.get_value())

# secrets are case-sensitive
secret <- redivis$user("my_username")$secret("EXTERNAL_API_TOKEN")
# Or for organization secrets:
# secret <- redivis$organization("organization_name").secret("...")

# Avoid printing secrets, but rather pass them to other methods as needed
make_http_request("https://example.com/api", auth_token=secret$get_value())

Files

Notebooks offer special capabilities for files written to specific directories on the notebook's hard disk. Any files you've stored in a notebook's /out and /scratch directories will be available in the files modal. This modal can allow you to preview and download specific file outputs from your notebook.

Moreover, files written to the /out directory are always available, and will persist across notebook sessions. This allows for workflows where you can cache certain results between notebook sessions, avoiding the need to rerun time-intensive computations.

The files in the /scratch directory are only available when the notebook is running, and will be cleared once it is stopped. The default "working directory" of all notebooks is /scratch – this is where files will be written if you do not specify another location.

You can view the files in either directory by pressing the Files button at the top right of the notebook.

You can list files in either directory by pressing the corresponding tab, and click on any file to view it. Redivis supports interactive previews for many file types in the file inspector, and you can also download the file for further inspection and analysis. To download all files in a directory, click the Download all button in the files modal.

Version history

Every time you stop your notebook, all cell inputs (your code and markdown) will be saved and associated with that notebook session. You can view the code from all previous sessions by pressing the History button at the top right of your notebook, allowing you to view and share your code as it was any previous point in time.

Access rules

Determining notebook access

Your access to a notebook is determined by your corresponding access to all tables (and their antecedent datasets) referenced by the notebook. These linkages persist across notebook sessions, as a future session could reference data from a previous session. In order to reset the tables referenced by your notebook, which will also clear all outputs in the notebook, you can choose to Clear outputs and start when starting the notebook.

Access levels

In order to view a notebook, you must first have view access to the corresponding workflow, and in order to run and edit the notebook, you must also have edit access to that workflow.

Additionally, your access to a notebook is governed by your access to its source tables. In order to run a notebook and see its outputs, you must have data access to all source tables. If you have metadata access, you will be able to see cell inputs in a notebook (that is, the code), but not outputs. If you only have overview (or no) access to the source tables, you will not be able to see notebook contents.

External internet access

If a notebook contains data with export restrictions, access to the external internet will be disabled while the notebook is running.

When the internet is disabled in a notebook you can still specify packages and other startup scripts in the Dependencies modal that will be installed on notebook start. Additionally, if any of your packages require internet access to run, you'll need to attempt to "preload" any content using a post-install script. For example, if you're using the tidycensus package in R, you could preload content as follows:

R -e '
  library(tidycensus)
  library(tidyverse)

  census_api_key("YOUR API KEY GOES HERE")
  get_decennial(geography = "state", 
                 variables = "P13_001N", 
                 year = 2020,
                 sumfile = "dhc")
'

Downloading files

Typically, you will be able to download any files written to the notebooks /out or /scratch directories. However, if a notebook references data with export restrictions, you will not be able to download these files, unless the file size is smaller than the relevant size-based export restrictions specified on source datasets.

Exporting notebooks

Notebooks can be downloaded as PDF, HTML, and .ipynb files by clicking the three-dot More button at the top right of the notebook.

You will be given the option of whether to include cell outputs in your export — it is important that you ensure the outputs displayed in your notebook do not contain sensitive data, and that your subsequent distribution is in compliance with any data use agreements.

Last updated 1 month ago

Was this helpful?