Work with data in notebooks

Overview

Redivis notebooks a performant, flexible environment for analysis that allow you to analyze and visualize data in workflows in Python, R, Stata, or SAS. With the notebook computation happening on Redivis, you don't need to configure an environment on a local machine or server, or export data from Redivis. This makes for easy iteration and collaboration, not to mention ensuring better security and data throughput.

Before working with a notebook you'll want to get started first by creating a workflow and adding data. You can then create a notebook off of any table in your workflow.

If you are working with very large tables (>10GB is a good rule of thumb), it's always a good idea to first reshape and reduce the data via transforms, since they can be significantly more performant for large data operations than running code in Python, R, Stata, or SAS.

1. Create a notebook

Once you have a table that you're ready to analyze, you can create a notebook by clicking the + Notebook button at any time. You'll need to name it and choose a kernel (Python, R, Stata, or SAS).

Notebooks can only reference tables within their workflow, so we recommend keeping all related work together in the same workflow.

Python

Python notebooks come pre-installed with a variety of common scientific packages for python. Learn more about working with python notebooks.

R

R notebooks come pre-installed with a variety of common scientific packages for R. Learn more about working with R notebooks.

Stata

Stata notebooks are based off of python notebooks, but offer affordances for moving data between Python and Stata. Learn more about working with Stata notebooks.

SAS

SAS notebooks are based off of python notebooks, but offer affordances for moving data between Python and SAS. Learn more about working with SAS notebooks.

2. Define dependencies

All notebooks come with a number of common packages pre-installed, depending on the notebook type. But if there is something specific you'd like to include, you can add versioned packages or write a pre-/post- install script by clicking the Edit dependencies button in the start modal or the toolbar.

Learn more in the Notebooks reference section.

3. Compute resources

The default notebook configuration is free, and provides access to 2 CPUs and 32GB working memory, alongside a 60GB (SSD) disk and gigabit network. The computational powerful of these default notebooks are comparable to most personal computers, and will be more than enough for many analyses.

If you're working with larger tables, creating an ML model, or performing other particularly intensive tasks, you may choose to configure additional compute resources for the notebook. This will cost an hourly rate to run based on your chosen environment, and require you to purchase compute credits on your account.

Clicking Edit compute configuration button in the start modal or the toolbar will allow you to choose from different preconfigured machine types. The notebook will then default to this compute configuration each time it starts up.

Learn more in the Compute resources reference section.

4. Start the notebook

Notebook nodes need to be started in order to edit or execute cells. When first clicking on a notebook node, you will see a read-only view of its contents (including cell outputs). Click the Start notebook button in the toolbar to connect this notebook to compute resources.

When you create a notebook for the first time it will start automatically.

5. Load data

To do meaningful work in your notebook, you'll want to bring in the tabular and/or unstructured data that exists in your workflow into your notebook.

Referencing tables

Notebooks come pre-populated with templated code that pulls in data from the notebook's source table. You will need to run this cell to pull the data into the notebook, and you can see that it worked because this code will print a preview of the loaded data.

You can reference any other tables in this workflow by replicating this script and executing it with a different table reference. As a rule of thumb, notebooks will easily support interactive analysis of tables up to ~1GB; if your table is larger, try reducing it first by creating a transform, or make sure to familiarize yourself with the tools for working with larger tables in the notebook's programming language.

import redivis

# The source table of this notebook can always be referenced as "_source_"
table = redivis.table("_source_")

# Load table as a pandas dataframe. 
# Consult the documentation for more load options.
df = table.to_pandas_dataframe()

# We can also reference any other table in this workflow by name.
df2 = redivis.table("my_other_table").to_pandas_dataframe()

print(df)
print(df2)

See more examples in the Python notebooks reference.

# The source table of this notebook can always be referenced as "_source_"
redivis_table <- redivis$table("_source_")

# Load table as a tidyverse tibble. 
# Consult the documentation for more load options.
df <- redivis_table$to_tibble()

# We can also reference any other table in this workflow by name.
df2 <- redivis$table("my_other_table")$to_tibble()

print(df)
print(df2)

See more examples in the R notebooks reference.

# In order to load data into Stata, we first have to bring it into Python.
# This code loads the "_source_" table in the python variable `df`
# We can then pass this variable as our stata dataset.

import redivis

# The source table of this notebook can always be referenced as "_source_"
# Reference any other table in this workflow by name.
table = redivis.table("_source_")

df = table.to_pandas_dataframe(dtype_backend="numpy")

%%stata -d df -force
/*
# Use the %%stata magic to load our dataframe, specified by the -d parameter
# The -force flag replaces the the current working dataset in Stata

# The rest is just Stata code!
*/

describe

See more examples in the Stata notebooks reference.

import saspy
sas = saspy.SASsession(results='HTML')

# We first load the table via python, and then pass the dataframe into SAS
df = redivis.table("_source_").to_pandas_dataframe(dtype_backend="numpy")

sas_data = sas.df2sd(df, '_df') # second argument is the name of the table in SAS
sas_data.heatmap('msrp', 'horsepower')

See more examples in the SAS notebooks reference.

Referencing files

Any files with unstructured data stored in Redivis tables can be referenced by their globally unique file_id. You can also reference these file_id's in any derivative tables, allowing you to query and download specific subsets of files.

When working with large files, you'll want to consider saving the files to disk and/or working with the streaming interfaces to reduce memory overhead and improve performance.

redivis_file = redivis.file("rnwk-acs3famee.pVr4Gzq54L3S9pblMZTs5Q")

# Download the file
download_location = redivis_file.download("./my-downloads")
f = open(download_location, "r")

# Read the file into a variable
file_content = redivis_file.read(as_text=True)
print(file_content)

# Stream the file as bytes or text
with redivis_file.stream() as f:
  f.read(100) # read 100 bytes

with TextIOWrapper(redivis_file.stream()) as f:
  f.readline() # read first line
  
# We can also iterate over all files in a table
for redivis_file in redivis.table("_source_").list_files():
  # Do stuff with file

See more examples in the Python notebooks reference.

redivis_file <- redivis$file("s335-8ey8zt7bx.qKmzpdttY2ZcaLB0wbRB7A")

# Download a file
redivis_file$download("/path/to/dir/", overwrite=TRUE)

# Read a file
data <- redivis_file$read(as_text=TRUE)
  
# Stream a file (callback gets called with each chunk)
data <- redivis_file$stream(function(x) {
  print(length(x))
})

# We can also iterate over all files in a table
for (redivis_file in redivis$table("_source_")$list_files()){
  # Do stuff with file
}

See more examples in the R notebooks reference.

6. Analyze data

At this point, you have all the tools you need to work with your data in your chosen language. The Python, R, Stata, and SAS ecosystems contain myriad tools and libraries for performing sophisticated data analysis and visualization.

The notebook interface is based off of Jupyter notebooks, and has similar capabilities. You can also export a read-only copy of your notebook as an .ipynb, PDF, or HTML file.

Learn more in the Notebooks reference section.

7. Create an output table

Notebooks can produce an output table, which you can sanity check and further analyze in your workflow by including in other notebooks or exporting to other systems.

# Read table into a pandas dataframe
df = redivis.table('_source_').to_pandas_dataframe()

# Perform various data manipulation actions
df2 = df.apply(some_processing_fn)

# Create an output table with the contents of this dataframe
redivis.current_notebook().create_output_table(df2)

See more examples in the Python notebooks reference.

# Read table into a tibble
tbl = redivis$table('_source_')$to_tibble()

# Perform various data manipulation actions
tbl2 = tbl %>% mutate(...)

# Create an output table with the contents of this dataframe
redivis$current_notebook()$create_output_table(tbl2)

See more examples in the R notebooks reference.

%%stata -doutd df2
/*
  Once this cell executes, the current dataset will be pushed 
  to the python variable df2
*/
rename v* newv*

# Via python, pass this dataframe to the output table
redivis.current_notebook().create_output_table(df2)

See more examples in the Stata notebooks reference.

# Convert a SAS table to a pandas dataframe
df = sas_table.to_df()

# Create an output table with the contents of this dataframe
redivis.current_notebook().create_output_table(df)

See more examples in the SAS notebooks reference.

Next steps

All Redivis notebooks support real-time collaboration, allowing multiple editors to edit and run cells in a running notebook. When another editor is active in a notebook, you will see a colored cursor associated with them (much like a Google Doc).

Share your workflow to work with collaborators in real time, and make it public so that others can fork off of and build upon your work.

Cite datasets in your publications

If the work you're doing leads to a publication, make sure to reference the dataset pages from datasets you've used for information from the data administrators on how to correctly cite it.

Last updated 5 months ago

Was this helpful?

Overview

1. Create a notebook

Python

R

Stata

SAS

2. Define dependencies

3. Compute resources

4. Start the notebook

5. Load data

Referencing tables

Referencing files

6. Analyze data

7. Create an output table

Next steps

Share and collaborate

Cite datasets in your publications