R notebooks
Overview
R notebooks are based off the jupyter/r-notebook base image (version r-4.1.1), which contains a variety of common scientific packages for R. The latest version of the redivis-r package is also pre-installed.
As a general workflow, you'll use the redivis-r library to load data from the table(s) in your project, and then leverage R and its ecosystem to perform your analyses. You can optionally create an output table from your notebook, which can then be used like any other table in your project.
The specific approaches to working with data in a notebook will be informed in part by the size and types of data that you are working with. Some common approaches are outlined below, and you can consult the full redivis-r docs for comprehensive information:
Working with tabular data
When loading tabular data into your notebook, you'll typically bring it in as some sort of data frame. Specifically, you can load your data as:
The specific type of data frame is up to your preference, though there may be performance and memory implications that will matter for larger tables.
Which data frame should I pick?
Each library has its own interface for analyzing data, and some may be better suited to your analytical needs. It is also easy to interchange between different data frame types, so you need not pick just one. But to offer some guidance:
Keep it standard: tibble, data.frame, data.table
Maximum performance: arrow.Table
Data doesn't fit in memory: arrow.Dataset
Working with geospatial data
If your table contains geospatial variable(s), you can take advantage of the sf (simple features) package to utilize GIS functions and visualization. By default, calling Table$to_sf_tibble()
on a Redivis table with a variable of the geography type will return an instance of a SF tibble, with that variable specified as the corresponding geometry column.
If your table contains more than one geography variable, the first variable will be chosen as the geometry column. You can explicitly specify the geography variable via the geography_variable
parameter.
If you'd prefer to work with your geospatial data as a string, you can use any of the other table$to_*
methods. In these cases, the geography variable will be represented as a WKT-encoded string.
Working with larger tables
Typically, tabular data is loaded into memory for analysis. This is often the most performant option, but if your data exceeds available memory, you'll need to consider other approaches for working with data at this scale.
"Too big for memory" will vary significantly based on the types of analyses you'll be doing, but as a very rough rule of thumb, you should consider these options once your table(s) exceed 1/10th of the total available memory.
Often, the best solution is to limit the amount of data that is coming into your notebook. To do so, you can:
Leverage transforms to first filter / aggregate your data
Select only specific variables from a table by passing the
variables=list(str)
argument.Pre-filter data via a SQL query from within your notebook, via the redivis::query() method.
Pre-process data as it is loaded into your notebook, via the
batch_preprocessor
argument.
If your data is still pushing memory limits, there are two primary options. You can either store data on disk, or process data as a stream:
Storing data on disk
Hard disks are often much larger than available memory, and by loading data first to disk, you can significantly increase the amount of data available in the notebook. Moreover, modern columnar data formats support partitioning and predicate pushdown, allowing us to perform highly performant analyses on these disk-backed dataframes.
The general approach for these disk-backed dataframes is to lazily evaluate our computation, only pulling content into memory after all computations have been applied, and ideally the data has been reduced. The methods to_arrow_dataset()
returns a disk-backed dataframe that supports most dplyr methods:
Arrow datasets also support batched processing, which allows you to process your data similar to the streaming methodology outlined below. While it will generally be faster to just process the stream directly, it can be helpful to first load a table to disk as you experiment with a streaming approach:
arrow.RecordBatch documentation >
Streaming data
By streaming data into your notebook, you can process data in batches of rows, avoiding the need to load more than a small chunk of data into memory at a time. This approach is the most scalable, since it won't be limited by available memory or disk. To do so, we can use the Table$to_arrow_batch_reader()
method
Working with unstructured data files
Unstructured data files on Redivis are represented by file index tables, or specifically, tables that contain a file_id
variable. If you have file index tables in your project, you can analyze the files represented in those tables within your notebook. Similarly to working with tabular data, we can either download all files, or iteratively process them:
Creating output tables
Redivis notebooks offer the ability to materialize notebook outputs as a new table node in your project. This table can then be processed by transforms, read into other notebooks, exported, or even re-imported into a dataset.
To create an output table, use the redivis::current_notebook()$create_output_table()
method, passing in any of the following as the first argument:
A string file path to any parquet file
Redivis will automatically handle any type inference in generating the output table, mapping your data type to the appropriate Redivis type.
If an output table for the notebook already exists, by default it will be overwritten. You can pass append=TRUE
to append, rather than overwrite, the table. In order for the append to succeed, all variables in the appended table, which are also present in the existing table, must have the same type.
Storing files
As you perform your analysis, you may generate files that are stored on the notebook's hard disk. There are two locations that you should write files to: /out
for persistent storage, and /scratch
for temporary storage.
Any files written to persistent storage will be available when the notebook is stopped, and will be restored to the same state when the notebook is run again. Alternatively, any files written to temporary storage will only exist for the duration of the current notebook session.
Last updated