Work with data in notebooks

Overview

Redivis notebooks a performant, flexible environment for analysis that allow you to analyze and visualize data in projects in Python, R, Stata, or SAS. With the notebook computation happening on Redivis, you don't need to configure an environment on a local machine or server, or export data from Redivis. This makes for easy iteration and collaboration, not to mention ensuring better security and data throughput.

Before working with a notebook you'll want to get started first by creating a project and adding data. You can then create a notebook off of any table in your project.

If you are working with very large tables (>10GB is a good rule of thumb), it's always a good idea to first reshape and reduce the data via transforms, since they can be significantly more performant for large data operations than running code in Python, R, Stata, or SAS.

1. Create a notebook

Once you have a table that you're ready to analyze, you can create a notebook by clicking the + Notebook button at any time. You'll need to name it and choose a kernel (Python, R, Stata, or SAS).

Notebooks can only reference tables within their project, so we recommend keeping all related work together in the same project.

Python

Python notebooks come pre-installed with a variety of common scientific packages for python. Learn more about working with python notebooks.

R

R notebooks come pre-installed with a variety of common scientific packages for R. Learn more about working with R notebooks.

Stata

Stata notebooks are based off of python notebooks, but offer affordances for moving data between Python and Stata. Learn more about working with Stata notebooks.

SAS

SAS notebooks are based off of python notebooks, but offer affordances for moving data between Python and SAS. Learn more about working with SAS notebooks.

2. Define dependencies

All notebooks come with a number of common packages pre-installed, depending on the notebook type. But if there is something specific you'd like to include, you can add versioned packages or write a pre-/post- install script by clicking the Edit dependencies button in the start modal or the toolbar.

Learn more in the Notebooks reference section.

3. Compute resources

The default notebook configuration is free, and provides access to 2 CPUs and 32GB working memory, alongside a 60GB (SSD) disk and gigabit network. The computational powerful of these default notebooks are comparable to most personal computers, and will be more than enough for many analyses.

If you're working with larger tables, creating an ML model, or performing other particularly intensive tasks, you may choose to configure additional compute resources for the notebook. This will cost an hourly rate to run based on your chosen environment, and require you to purchase compute credits on your account.

Clicking Edit compute configuration button in the start modal or the toolbar will allow you to choose from different preconfigured machine types. The notebook will then default to this compute configuration each time it starts up.

Learn more in the Compute resources reference section.

4. Start the notebook

Notebook nodes need to be started in order to edit or execute cells. When first clicking on a notebook node, you will see a read-only view of its contents (including cell outputs). Click the Start notebook button in the toolbar to connect this notebook to compute resources.

When you create a notebook for the first time it will start automatically.

5. Load data

To do meaningful work in your notebook, you'll want to bring in the tabular and/or unstructured data that exists in your project into your notebook.

Referencing tables

Notebooks come pre-populated with templated code that pulls in data from the notebook's source table. You will need to run this cell to pull the data into the notebook, and you can see that it worked because this code will print a preview of the loaded data.

You can reference any other tables in this project by replicating this script and executing it with a different table reference. As a rule of thumb, notebooks will easily support interactive analysis of tables up to ~1GB; if your table is larger, try reducing it first by creating a transform, or make sure to familiarize yourself with the tools for working with larger tables in the notebook's programming language.

import redivis

# The source table of this notebook can always be referenced as "_source_"
table = redivis.table("_source_")

# Load table as a pandas dataframe. 
# Consult the documentation for more load options.
df = table.to_pandas_dataframe()

# We can also reference any other table in this project by name.
df2 = redivis.table("my_other_table").to_pandas_dataframe()

print(df)
print(df2)

See more examples in the Python notebooks reference.

Referencing files

Any files with unstructured data stored in Redivis tables can be referenced by their globally unique file_id. You can also reference these file_id's in any derivative tables, allowing you to query and download specific subsets of files.

When working with large files, you'll want to consider saving the files to disk and/or working with the streaming interfaces to reduce memory overhead and improve performance.

redivis_file = redivis.file("rnwk-acs3famee.pVr4Gzq54L3S9pblMZTs5Q")

# Download the file
download_location = redivis_file.download("./my-downloads")
f = open(download_location, "r")

# Read the file into a variable
file_content = redivis_file.read(as_text=True)
print(file_content)

# Stream the file as bytes or text
with redivis_file.stream() as f:
  f.read(100) # read 100 bytes

with TextIOWrapper(redivis_file.stream()) as f:
  f.readline() # read first line
  
# We can also iterate over all files in a table
for redivis_file in redivis.table("_source_").list_files():
  # Do stuff with file

See more examples in the Python notebooks reference.

6. Analyze data

At this point, you have all the tools you need to work with your data in your chosen language. The Python, R, Stata, and SAS ecosystems contain myriad tools and libraries for performing sophisticated data analysis and visualization.

The notebook interface is based off of Jupyter notebooks, and has similar capabilities. You can also export a read-only copy of your notebook as an .ipynb, PDF, or HTML file.

Learn more in the Notebooks reference section.

7. Create an output table

Notebooks can produce an output table, which you can sanity check and further analyze in your project by including in other notebooks or exporting to other systems.

# Read table into a pandas dataframe
df = redivis.table('_source_').to_pandas_dataframe()

# Perform various data manipulation actions
df2 = df.apply(some_processing_fn)

# Create an output table with the contents of this dataframe
redivis.current_notebook().create_output_table(df2)

See more examples in the Python notebooks reference.

Next steps

Share and collaborate

All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them (much like a Google Doc).

Share your project to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.

Cite datasets in your publications

If the work you're doing leads to a publication, make sure to reference the dataset pages from datasets you've used for information from the data administrators on how to correctly cite it.

Last updated