Notebook concepts

Overview

Notebooks provide a highly flexible compute environment for working with data on Redivis. In a notebook, you can reference any table in your project, install dependencies, perform analyses in Python, R, Stata, or SAS, store and download files, and generate an output table for downstream analysis.

Transforms vs. notebooks?

There are two mechanisms for working with data in projects: transforms and notebooks. Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.

Transforms are better for:

  • Reshaping + combining tabular and geospatial data

  • Working with large tables, especially at the many GB to TB scale

  • Preference for a no-code interface, or preference for programming in SQL

  • Declarative, easily documented data operations

Notebooks are better for:

  • Interactive exploration of any data type, including unstructured data files

  • Working with smaller tables (though working with bigger data is possible)

  • Preference for Python, R, Stata, or SAS

  • Interactive visualizations and figure generation

Working with data

Loading data

From within your notebook, you can load any data available in your project. You can reference the primary source table of the notebook via the special _source_ identifier, or reference any other table in the project by its name. To ensure that your notebook doesn't break when tables get renamed, make sure to use the qualified reference for non-primary tables. For example:

# Reference the source table with the special "_source_" identifier:
df = redivis.table("_source_").to_pandas_dataframe()

# Reference any other table via its name:
# The last 4 characters are the reference id. This is optional, 
#     but recommended to ensure the notebook works as tables get renamed.
df2 = redivis.table("daily_observations:vdwn").to_pandas_dataframe()

# If our table is a file index table, we can load and process those files
for f in redivis.table("_source_").list_files():
    data = f.read()

Analyzing data

Redivis notebooks support the following kernels (programming languages). For more details and examples on how to use notebooks in each language, consult the language-specific documentation:

pagePython notebookspageR notebookspageStata notebookspageSAS notebooks

Outputting tables

A notebook can generate an output table as the result of its execution. This output table is created programmatically, e.g.:

# Multiple types for "df" are supported
# Consult the language-specific docs for more info
df = get_dataframe_somehow()

redivis.current_notebook().create_output_table(df)

Storing files

As you perform your analysis, you may generate files that are stored on the notebook's hard disk. There are two locations that you should write files to: /out for persistent storage, and /scratch for temporary storage.

Any files written to persistent storage will be available when the notebook is stopped, and will be restored to the same state when the notebook is run again. Alternatively, any files written to temporary storage will only exist for the duration of the current notebook session.

To write files to these directories, use the standard tools of your programming language for writing files. E.g.,:

df = get_dataframe_somehow()

# Write temporary files to /scratch, and files you want to persist to /out:
df.to_csv("/out/data.csv")

You can inspect and download these files anytime.

Notebook management

Creation

Create a notebook by clicking on a table node in a project and selecting + Notebook. This table will become the default source table for your new notebook and will have pre-generated code that references the table's data.

Starting and stopping

Notebook nodes need to be started in order to edit or execute cells. Click the purple Start notebook button in the top right to start the notebook and provision compute resources. You can also elect to "Clear outputs and start", which will remove all outputs and reset any referenced tables in the notebook.

Compute configuration

By default notebooks are provisioned with 32GB memory and 2 CPU cores, with compute power comparable to most personal computers. You can view and alter the notebook's compute resources in the More menu.

Persistence

All notebooks are automatically saved as you go. Every time a notebook is stopped, all cell inputs are saved to the notebook version history, giving you a historical record of all code that was run. Additionally, all cell outputs from the last notebook session will be preserved, as will any files written to the /out directory.

Logs

You can click the three-dot More menu to open the logs for this notebook. Opening the logs when a notebook is stopped will show the logs from the notebook's previous run.

Lifecycle

Notebooks have a maximum lifetime of 6 hours, and after 30 minutes of inactivity, any running notebook will automatically be stopped. Activity is determined based on the Jupyter kernel — if you have a long-running computation, the notebook will be considered as active for the entire time.

Collaboration

All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them. Project viewers will see a read-only version of the notebook.

Changing the source table

To change a notebook's primary source table, either right-click on the notebook or click the three-dot () icon and select the "change source table" option.

Limitations

Notebooks are subject to certain concurrency and duration limits.

Dependencies

All notebooks come with a number of common packages pre-installed. You can install additional packages by clicking the Edit dependencies button in the notebook start modal or toolbar.

If you are working with restricted data, the internet will disabled in your notebook while it's running and you will need to input any dependencies here to be applied during the startup process. While the notebook is running you can configure additional packages here which will be saved for the next time you start the notebook.

Packages

Packages from PyPI (python) or Posit (R) can be referenced here by name. The latest version will automatically be filled, but if you would like to change the version you can edit it here.

Stata dependencies are installed via the pre_install.sh or post_install.sh scripts (see below). E.g.:

stata -e -q 'ssc install outreg'

Pre-install and post-install scripts

To support full flexibility, you can also define pre- and post-install shell scripts for your notebook (executed on either side of the packages install).

For example:

# Install base linux packages
apt-get install libgdal-dev

# Install R packages from github
R -e 'library(devtools)
install_github("DeveloperName/PackageName")'

# Install python packages from github
pip install https://github.com/user/repo.git@branch

# Install Stata packages
stata -e -q 'ssc install outreg'

If your notebook utilizes data with export restrictions, the dependencies scripts are the only time the notebook will be able to connect to the outside internet.

Files

Notebooks offer special capabilities for files written to specific directories on the notebook's hard disk. Any files you've stored in a notebook's /out and /scratch directories will be available in the files modal. This modal can allow you to preview and download specific file outputs from your notebook.

Moreover, files written to the /out directory are always available, and will persist across notebook sessions. This allows for workflows where you can cache certain results between notebook sessions, avoiding the need to rerun time-intensive computations.

The files in the /scratch directory are only available when the notebook is running, and will be cleared once it is stopped. The default "working directory" of all notebooks is /scratch – this is where files will be written if you do not specify another location.

You can view the files in either directory by pressing the Files button at the top right of the notebook.

You can list files in either directory by pressing the corresponding tab, and click on any file to view it. Redivis supports interactive previews for many file types in the file inspector, and you can also download the file for further inspection and analysis. To download all files in a directory, click the Download all button in the files modal.

Version history

Every time you stop your notebook, all cell inputs (your code and markdown) will be saved and associated with that notebook session. You can view the code from all previous sessions by pressing the History button at the top right of your notebook, allowing you to view and share your code as it was any previous point in time.

Access rules

Determining notebook access

Your access to a notebook is determined by your corresponding access to all tables (and their antecedent datasets) referenced by the notebook. These linkages persist across notebook sessions, as a future session could reference data from a previous session. In order to reset the tables referenced by your notebook, which will also clear all outputs in the notebook, you can choose to Clear outputs and start when starting the notebook.

Access levels

In order to view a notebook, you must first have view access to the corresponding project, and in order to run and edit the notebook, you must also have edit access to that project.

Additionally, your access to a notebook is governed by your access to its source tables. In order to run a notebook and see its outputs, you must have data access to all source tables. If you have metadata access, you will be able to see cell inputs in a notebook (that is, the code), but not outputs. If you only have overview (or no) access to the source tables, you will not be able to see notebook contents.

External internet access

If a notebook contains data with export restrictions, internet access within the running notebook will be disabled. You can still specify packages and other startup scripts that will be installed when the notebook first starts, but once the notebook is running, internet will be disabled.

Downloading files

Typically, you will be able to download any files written to the notebooks /out or /scratch directories. However, if a notebook references data with export restrictions, you will not be able to download these files, unless the file size is smaller than the relevant size-based export restrictions specified on source datasets.

Exporting notebooks

Notebooks can be downloaded as PDF, HTML, and .ipynb files by clicking the three-dot More button at the top right of the notebook.

You will be given the option of whether to include cell outputs in your export — it is important that you ensure the outputs displayed in your notebook do not contain sensitive data, and that your subsequent distribution is in compliance with any data use agreements.

Last updated