Notebook concepts
Overview
Notebooks provide a highly flexible compute environment for working with data on Redivis. In a notebook, you can reference any table in your workflow, install dependencies, perform analyses in Python, R, Stata, or SAS, store and download files, and generate an output table for downstream analysis.
Transforms vs. notebooks?
There are two mechanisms for working with data in workflows: transforms and notebooks. Understanding when to use each tool is key to taking full advantage of the capabilities of Redivis, particularly when working with big datasets.
Transforms are better for:
Reshaping + combining tabular and geospatial data
Working with large tables, especially at the many GB to TB scale
Preference for a no-code interface, or preference for programming in SQL
Declarative, easily documented data operations
Notebooks are better for:
Interactive exploration of any data type, including unstructured data files
Working with smaller tables (though working with bigger data is possible)
Preference for Python, R, Stata, or SAS
Interactive visualizations and figure generation
Working with data
Loading data
From within your notebook, you can load any data available in your workflow. You can reference the primary source table of the notebook via the special _source_
identifier, or reference any other table in the workflow by its name. To ensure that your notebook doesn't break when tables get renamed, make sure to use the qualified reference for non-primary tables. For example:
Analyzing data
Redivis notebooks support the following kernels (programming languages). For more details and examples on how to use notebooks in each language, consult the language-specific documentation:
Outputting tables
A notebook can generate an output table as the result of its execution. This output table is created programmatically, e.g.:
Storing files
As you perform your analysis, you may generate files that are stored on the notebook's hard disk. There are two locations that you should write files to: /out
for persistent storage, and /scratch
for temporary storage.
Any files written to persistent storage will be available when the notebook is stopped, and will be restored to the same state when the notebook is run again. Alternatively, any files written to temporary storage will only exist for the duration of the current notebook session.
To write files to these directories, use the standard tools of your programming language for writing files. E.g.,:
You can inspect and download these files anytime.
Notebook management
Creation
Create a notebook by clicking on a table node in a workflow and selecting + Notebook. This table will become the default source table for your new notebook and will have pre-generated code that references the table's data.
Starting and stopping
Notebook nodes need to be started in order to edit or execute cells. Click the purple Start notebook button in the top right to start the notebook and provision compute resources. You can also elect to "Clear outputs and start", which will remove all outputs and reset any referenced tables in the notebook.
Compute configuration
By default notebooks are provisioned with 32GB memory and 2 CPU cores, with compute power comparable to most personal computers. You can view and alter the notebook's compute resources in the More menu.
Persistence
All notebooks are automatically saved as you go. Every time a notebook is stopped, all cell inputs are saved to the notebook version history, giving you a historical record of all code that was run. Additionally, all cell outputs from the last notebook session will be preserved, as will any files written to the /out
directory.
Logs
You can click the three-dot More menu to open the logs for this notebook. Opening the logs when a notebook is stopped will show the logs from the notebook's previous run.
Lifecycle
Notebooks have a maximum lifetime of 6 hours, and after 30 minutes of inactivity, any running notebook will automatically be stopped. Activity is determined based on the Jupyter kernel — if you have a long-running computation, the notebook will be considered as active for the entire time.
Collaboration
All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them. Workflow viewers will see a read-only version of the notebook.
Changing the source table
To change a notebook's primary source table, either right-click on the notebook or click the three-dot (ⵗ
) icon and select the "change source table" option.
Limitations
Notebooks are subject to certain concurrency and duration limits.
Dependencies
All notebooks come with a number of common packages pre-installed. You can install additional packages by clicking the Edit dependencies button in the notebook start modal or toolbar.
For more detailed information about the default dependencies and adding new packages, consult the documentation for your notebook type:
For notebooks that reference restricted data, internet will be disabled while the notebook is running. This means that the dependencies interface is the only place from which you can install dependencies – e.g., running pip install
for python or devtools::install()
for R within your notebook will fail.
Moreover, it is strongly recommended to always install your dependencies through the dependencies interface (regardless of whether your notebook has internet access), as this provides better reproducibility and documentation for future use.
Files
Notebooks offer special capabilities for files written to specific directories on the notebook's hard disk. Any files you've stored in a notebook's /out
and /scratch
directories will be available in the files modal. This modal can allow you to preview and download specific file outputs from your notebook.
Moreover, files written to the /out
directory are always available, and will persist across notebook sessions. This allows for workflows where you can cache certain results between notebook sessions, avoiding the need to rerun time-intensive computations.
The files in the /scratch
directory are only available when the notebook is running, and will be cleared once it is stopped. The default "working directory" of all notebooks is /scratch
– this is where files will be written if you do not specify another location.
You can view the files in either directory by pressing the Files button at the top right of the notebook.
You can list files in either directory by pressing the corresponding tab, and click on any file to view it. Redivis supports interactive previews for many file types in the file inspector, and you can also download the file for further inspection and analysis. To download all files in a directory, click the Download all button in the files modal.
Version history
Every time you stop your notebook, all cell inputs (your code and markdown) will be saved and associated with that notebook session. You can view the code from all previous sessions by pressing the History button at the top right of your notebook, allowing you to view and share your code as it was any previous point in time.
Access rules
Determining notebook access
Your access to a notebook is determined by your corresponding access to all tables (and their antecedent datasets) referenced by the notebook. These linkages persist across notebook sessions, as a future session could reference data from a previous session. In order to reset the tables referenced by your notebook, which will also clear all outputs in the notebook, you can choose to Clear outputs and start when starting the notebook.
Access levels
In order to view a notebook, you must first have view access to the corresponding workflow, and in order to run and edit the notebook, you must also have edit access to that workflow.
Additionally, your access to a notebook is governed by your access to its source tables. In order to run a notebook and see its outputs, you must have data access to all source tables. If you have metadata access, you will be able to see cell inputs in a notebook (that is, the code), but not outputs. If you only have overview (or no) access to the source tables, you will not be able to see notebook contents.
External internet access
If a notebook contains data with export restrictions, internet access within the running notebook will be disabled. You can still specify packages and other startup scripts that will be installed when the notebook first starts, but once the notebook is running, internet will be disabled.
Downloading files
Typically, you will be able to download any files written to the notebooks /out
or /scratch
directories. However, if a notebook references data with export restrictions, you will not be able to download these files, unless the file size is smaller than the relevant size-based export restrictions specified on source datasets.
Exporting notebooks
Notebooks can be downloaded as PDF, HTML, and .ipynb files by clicking the three-dot More button at the top right of the notebook.
You will be given the option of whether to include cell outputs in your export — it is important that you ensure the outputs displayed in your notebook do not contain sensitive data, and that your subsequent distribution is in compliance with any data use agreements.
Last updated