# Python notebooks

## Overview

Python notebooks provide a mechanism to interface between the python scientific stack and data on Redivis.

As a general workflow, you'll use the [redivis-python library](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python) to load data from the table(s) in your workflow, and then leverage python and its ecosystem to perform your analyses. You can optionally [create an output table](#creating-output-tables) from your notebook, which can then be used like any other table in your workflow.

The specific approaches to working with data in a notebook will be informed in part by the size and types of data that you are working with. Some common approaches are outlined below, and you can consult the full [redivis-python docs](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python) for comprehensive information.

## Base image and dependencies

Python notebooks on Redivis are based off the [jupyter/pytorch-notebook base image](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#jupyter-pytorch-notebook) (version [cuda12-python-3.12](https://quay.io/repository/jupyter/pytorch-notebook?tab=tags)), which contains a variety of common scientific packages for Python running on Ubuntu 24.04. The latest version of the [redivis-python library](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python) is also installed. To view all installed python packages, run `pip list` from within a running notebook.

To further customize your compute environment, you can specify various dependencies by clicking the **Dependencies** button at the top-right of your notebook. Here you will see three tabs: **Packages, pre\_install.sh**, and **post\_install.sh**.

Use packages to specify the specific python packages that you would like to install via PIP. When adding a new package, it will be pinned to the latest version of that package, but you can specify another version if preferred.

For more complex dependency management, you can also specify shell scripts under `pre/post_install.sh`. These scripts are executed on either side of the package installation, and are used to execute arbitrary code in the shell. Common use cases might include using `apt` to install system packages (`apt-get update && apt-get install -y <package>`), or using `mamba` to install from conda, which can be helpful for certain libraries (`mamba install <package>`).

{% hint style="info" %}
For notebooks that reference restricted data, internet will be disabled while the notebook is running. This means that the dependencies interface is the *only* place from which you can install dependencies; running `pip install ...` within your notebook will fail.

Moreover, it is strongly recommended to always install your dependencies through the dependencies interface (regardless of whether your notebook has internet access), as this provides better reproducibility and documentation for future use.
{% endhint %}

## Working with tabular data

When loading tabular data into your notebook, you'll typically bring it in as some sort of data frame. Specifically, you can load your data as:

* [A pandas.DataFrame](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python/reference/table/table.to_pandas_dataframe)
* [A dask.Dataframe](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python/reference/table/table.to_dask_dataframe)
* [A polars.LazyFrame](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python/reference/table/table.to_polars_lazyframe)
* [A pyarrow.Table](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python/reference/table/table.to_arrow_table)
* [A pyarrow.Dataset](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python/reference/table/table.to_arrow_dataset)

The specific type of data frame is up to your preference, though there may be performance and memory implications that will matter for larger tables.

```python
table = redivis.table("_source_")

pandas_df = table.to_pandas_dataframe(
  # max_results,      -> optional, max records to load
  # variables=list(), -> optional, a list of variables
  # ... consult the redivis-python docs for additional args
)

# other methods accept the same arguments, other than dtype_backend
dask_df = table.to_dask_dataframe()
polars_lf = table.to_polars_lazyframe()
arrow_table = table.to_arrow_table()
arrow_dataset = table.to_arrow_dataset()

# print first 10 rows
any_df.head(10)
```

{% hint style="info" %}
**Which data frame should I pick?**

Each library has its own interface for analyzing data, and some may be better suited to your analytical needs. It is also easy to interchange between different data frame types, so you need not pick just one. But to offer some guidance:

* Keep it standard: **pandas**
* Parallel processing: **dask**
* Fast new kid on the block: **polars**
* Data doesn't fit in memory: **pyarrow\.Dataset**, **dask, polars**
  {% endhint %}

## Working with geospatial data

If your table contains geospatial variable(s), you can take advantage of geopandas to utilize GIS functions and visualization. Calling [`to_geopandas_dataframe()`](https://apidocs.redivis.com/client-libraries/redivis-python/reference/table/table.to_geopandas_dataframe) on a Redivis table with a variable of the geography type will return an instance of a [geopandas.DataFrame](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html), with that variable specified as the data frame's geometry variable.&#x20;

If your table contains more than one geography variable, the first variable will be chosen as the geometry. You can explicitly specify the geography variable via the `geography_variable` parameter.

If you'd prefer to work with your geospatial data as a string, you can use any of the other table.to\_\* methods. In these cases, the geography variable will be represented as a WKT-encoded string.

```python
table = redivis.table("_source_") # a table with a geography variable

geo_df = table.to_geopandas_dataframe(
  # geography_variable -> optional, str. If not specified, will be first geo var in the table
)
geo_df.explore() # visualize it!
```

## Working with larger tables

Typically, tabular data is loaded into memory for analysis. This is often the most performant option, but if your data exceeds available memory, you'll need to consider other approaches for working with data at this scale.&#x20;

{% hint style="info" %}
"Too big for memory" will vary *significantly* based on the types of analyses you'll be doing, but as a very rough rule of thumb, you should consider these options once your table(s) exceed 1/10th of the total available memory.&#x20;
{% endhint %}

Often, the best solution is to limit the amount of data that is coming into your notebook. To do so, you can:

* Leverage [transforms](https://docs.redivis.com/reference/workflows/transforms) to first filter / aggregate your data
* Select only specific variables from a table by passing the `variables=list(str)` argument.
* Pre-filter data via a SQL query from within your notebook, via the [redivis.query() method](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python/reference/query).
* Pre-process data as it is loaded into your notebook, via the `batch_preprocessor` argument.&#x20;

If your data is still pushing memory limits, there are two primary options. You can either store data on disk, or process data as a stream:

#### Storing data on disk

Hard disks are often much larger than available memory, and by loading data first to disk, you can significantly increase the amount of data available in the notebook. Moreover, modern columnar data formats support partitioning and predicate pushdown, allowing us to perform highly performant analyses on these disk-backed dataframes.

The general approach for these disk-backed dataframes is to *lazily* evaluate our computation, only pulling content into memory after all computations have been applied, and ideally the data has been reduced. The [`redivis.Table`](https://docs.redivis.com/reference/workflows/notebooks/broken-reference) methods [`to_dask_dataframe()`](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python/reference/table/table.to_dask_dataframe) , [`to_polars_lazyframe()`](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python/reference/table/table.to_polars_lazyframe) , and [`to_arrow_dataset()`](https://docs.redivis.com/reference/workflows/notebooks/broken-reference) all return a disk-backed dataframe:

{% tabs %}
{% tab title="dask" %}

```python
dask_df = redivis.table("test_scores").to_dask_dataframe()

df = df[df.grade == 9]                        # Select a subsection
result = df.groupby("teacher").score.mean()   # Reduce to a smaller size
result = result.compute()                     # Convert to pandas dataframe
```

[dask groupby documentation](https://docs.dask.org/en/stable/dataframe-groupby.html)
{% endtab %}

{% tab title="polars" %}

```python
polars_lf = redivis.table("test_scores").to_polars_lazyframe()

polars_lf.filter("grade" == 10)     # Select a subsection
    .group_by("teacher")            # Reduce to a smaller size
    .mean()                         
    .collect()                      # Convert to polars.DataFrame              
```

[polars groupby documentation](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/group_by.html)
{% endtab %}

{% tab title="pyarrow" %}

```python
import pyarrow.compute as pc

arrow_ds = redivis.table("test_scores").to_arrow_dataset()
arrow_ds.filter(pc.field("grade") == 10)              
```

[pyarrow.Dataset.filter documentation](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.filter)
{% endtab %}
{% endtabs %}

All three of these libraries also support various forms of batched processing, which allows you to process your data similar to the streaming methodology outlined below. While it will generally be faster to just process the stream directly, it can be helpful to first load a table to disk as you experiment with a streaming approach:&#x20;

{% tabs %}
{% tab title="dask" %}

```python
dask_df = redivis.table("_source_").to_dask_dataframe()
dask_df.apply(process_record, axis=1)
```

[dask.DataFrame.apply documentation](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html#dask.dataframe.DataFrame.apply)
{% endtab %}

{% tab title="polars" %}

```python
polars_lf = redivis.table("_source_").to_polars_lazyframe()

polars_lf.map_batches(process_record_batch)
```

[polars.LazyFrame.map\_batches documentation](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.map_batches.html#polars.LazyFrame.map_batches)
{% endtab %}

{% tab title="pyarrow" %}

```python
arrow_ds = redivis.table("_source_").to_arrow_dataset()

for batch in arrow_ds.to_batches():
    process_record_batch(batch)
```

[pyarrow.Dataset.to\_batches documentation](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches)
{% endtab %}
{% endtabs %}

#### Streaming data

By streaming data into your notebook, you can process data in batches of rows, avoiding the need to load more than a small chunk of data into memory at a time. This approach is the most scalable, since it won't be limited by available memory or disk. For this, we can use the [`Table.to_arrow_batch_iterator()`](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python/reference/table/table.to_arrow_batch_iterator) method:

```python
batch_iterator = redivis.table("test_scores").to_arrow_batch_iterator()

count = 0
total = 0
for batch in batch_iterator:
    # batch is an instance of pyarrow.RecordBatch -> https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html
    # Call batch.to_pandas() to convert to a pandas dataframe
    scores = batch.column("scores")
    count += len(scores)
    total += sum(scores)

print(f"The average of all test cores was {total/count}")
```

## Working with unstructured data files

Unstructured data files on Redivis are represented by [file index tables](https://docs.redivis.com/datasets/data#folders-and-index-tables), or specifically, tables that contain a `file_id` variable. If you have file index tables in your workflow, you can analyze the files represented in those tables within your notebook. Similarly to working with tabular data, we can either download all files, or iteratively process them:

* [Table.file()](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-r/reference/file)
* [Table.to\_directory()](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-python/reference/table/table.to_directory)
* [File methods](https://app.gitbook.com/s/GCgH8jTSmY8Vgwceiri5/client-libraries/redivis-r/reference/file)

```python
# e.g., assume we have a source table representing thousands of .png files
images = redivis.table("_source_")

# get/list files
images.list_files(max_results=10)
images.file("image_name.png")

# Load all files as a directory
dir = images.to_directory()
dir.download("/download/path")
dir.get("path/to/file")
dir.list(recursive=True)
dir.mount("/mount/path") # mounts a virtual directory, with data being lazily downloaded

file = images.file("image_name.png")
bytes = file.read()
with file.open() as f:
    """Process file"""
    
# Tools that integrate with fsspec can open Redivis URIs:
pystac.Catalog.from_file("redivis://table_ref/stac/catalog.json")
```

## Creating output tables

Redivis notebooks offer the ability to materialize notebook outputs as a new [table node](https://docs.redivis.com/reference/workflows/tables) in your workflow. This table can then be processed by transforms, read into other notebooks, exported, or even [re-imported into a dataset](https://docs.redivis.com/guides/create-and-manage-datasets/cleaning-tabular-data).

To create an output table, use the `redivis.current_notebook().create_output_table()` method, passing in any of the following as the first argument:

* [A pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
* [A dask.Dataframe](https://docs.dask.org/en/stable/dataframe-api.html)
* [A polars.LazyFrame](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/index.html)
* [A polars.DataFrame](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/index.html)
* [A pyarrow.Table](https://arrow.apache.org/docs/python/api/tables.html)
* [A pyarrow.Dataset](https://arrow.apache.org/docs/python/api/dataset.html)
* A string file path to any parquet file

Redivis will automatically handle any type inference in generating the output table, mapping your data type to the appropriate Redivis type.

If an output table for the notebook already exists, by default it will be overwritten. You can pass `append=True` to append, rather than overwrite, the table. In order for the append to succeed, all variables in the appended table, which are also present in the existing table, must have the same type.

```python
# Read table into a pandas dataframe
df = redivis.table('_source_').to_pandas_dataframe()

# Perform various data manipulation actions
df2 = df.apply(some_processing_fn)

# Create an output table with the contents of this dataframe
redivis.current_notebook().create_output_table(df2)

# We can also append content to the output table, to process in batches
df3 = df.apply(some_other_fn)
redivis.current_notebook().create_output_table(df3, append=True)
```

## Storing files

As you perform your analysis, you may generate files that are stored on the notebook's hard disk. There are two locations that you should write files to: `/out` for persistent storage, and `/scratch` for temporary storage.&#x20;

Any files written to persistent storage will be available when the notebook is stopped, and will be restored to the same state when the notebook is run again. Alternatively, any files written to temporary storage will only exist for the duration of the current notebook session.

```python
# Persist files in /out
df.to_csv("/out/data.csv")

# Store temporary files in /scratch
df.to_csv("/scratch/temp_data.csv")
```
