Redivis Documentation
API DocumentationRedivis Home
  • Introduction
  • Redivis for open science
    • FAIR data practices
    • Open access
    • Data repository characteristics
    • Data retention policy
    • Citations
  • Guides
    • Getting started
    • Discover & access data
      • Discover datasets
      • Apply to access restricted data
      • Create a study
    • Analyze data in a workflow
      • Reshape data in transforms
      • Work with data in notebooks
      • Running ML workloads
      • Example workflows
        • Analyzing large tabular data
        • Create an image classification model
        • Fine tuning a Large Language Model (LLM)
        • No-code visualization
        • Continuous enrollment
        • Select first/last encounter
    • Export & publish your work
      • Export to other environments
      • Build your own site with Observable
    • Create & manage datasets
      • Create and populate a dataset
      • Upload tabular data as tables
      • Upload unstructured data as files
      • Cleaning tabular data
    • Administer an organization
      • Configure access systems
      • Grant access to data
      • Generate a report
      • Example tasks
        • Emailing subsets of members
    • Video guides
  • Reference
    • Your account
      • Creating an account
      • Managing logins
      • Single Sign-On (SSO)
      • Workspace
      • Studies
      • Compute credits and billing
    • Datasets
      • Documentation
      • Tables
      • Variables
      • Files
      • Creating & editing datasets
      • Uploading data
        • Tabular data
        • Geospatial data
        • Unstructured data
        • Metadata
        • Data sources
        • Programmatic uploads
      • Version control
      • Sampling
      • Exporting data
        • Download
        • Programmatic
        • Google Data Studio
        • Google Cloud Storage
        • Google BigQuery
        • Embedding tables
    • Workflows
      • Workflow concepts
      • Documentation
      • Data sources
      • Tables
      • Transforms
        • Transform concepts
        • Step: Aggregate
        • Step: Create variables
        • Step: Filter
        • Step: Join
        • Step: Limit
        • Step: Stack
        • Step: Order
        • Step: Pivot
        • Step: Rename
        • Step: Retype
        • Step: SQL query
        • Variable selection
        • Value lists
        • Optimization and errors
        • Variable creation methods
          • Common elements
          • Aggregate
          • Case (if/else)
          • Date
          • DateTime
          • Geography
          • JSON
          • Math
          • Navigation
          • Numbering
          • Other
          • Statistical
          • String
          • Time
      • Notebooks
        • Notebook concepts
        • Compute resources
        • Python notebooks
        • R notebooks
        • Stata notebooks
        • SAS notebooks
        • Using the Jupyter interface
      • Access and privacy
    • Data access
      • Access levels
      • Configuring access
      • Requesting access
      • Approving access
      • Usage rules
      • Data access in workflows
    • Organizations
      • Administrator panel
      • Members
      • Studies
      • Workflows
      • Datasets
      • Permission groups
      • Requirements
      • Reports
      • Logs
      • Billing
      • Settings and branding
        • Account
        • Public profile
        • Membership
        • Export environments
        • Advanced: DOI configuration
        • Advanced: Stata & SAS setup
        • Advanced: Data storage locations
        • Advanced: Data egress configuration
    • Institutions
      • Administrator panel
      • Organizations
      • Members
      • Datasets
      • Reports
      • Settings and branding
    • Quotas and limits
    • Glossary
  • Additional Resources
    • Events and press
    • API documentation
    • Redivis Labs
    • Office hours
    • Contact us
    • More information
      • Product updates
      • Roadmap
      • System status
      • Security
      • Feature requests
      • Report a bug
Powered by GitBook
On this page
  • Overview
  • Enabling Stata notebooks
  • Base image and dependencies
  • Working with tabular data
  • Working with geospatial data
  • Working with larger tables
  • Creating output tables
  • Storing files

Was this helpful?

Export as PDF
  1. Reference
  2. Workflows
  3. Notebooks

Stata notebooks

Last updated 6 months ago

Was this helpful?

Overview

Stata notebooks are available for those researchers who are more comfortable using Stata and its ecosystem. These are built off the same base image as , but include the to allow for the execution of Stata in a notebook environment.

Working with Stata in a notebook environment is slightly different than the Stata desktop application, in that we need to utilize python to pass data into Stata. This step is quite simple, and doesn't require any expertise in python – see below.

While Stata is fully supported on Redivis, certain Redivis concepts, such as , don't have a corollary in Stata. Moreover, Stata doesn't support the sorts of parallelized stream processing available in Python and R.

Enabling Stata notebooks

Because Stata is proprietary software, you will need to provide a license for Stata 16 or later in order to enable Stata notebooks on Redivis. Organizations can specify license information in , which will make Stata notebooks available to all members of their organization. Alternatively, you can provide your own stata license in .

In the Jupyter-Stata documentation, you may see references to configuring stata via the stata_setup command. There is no need to run this command in Stata notebooks on Redivis, as everything has been pre-configured.

Base image and dependencies

Stata notebooks are based off the , and can combine both Stata and Python dependencies to create novel workflows.

To further customize your compute environment, you can specify various dependencies by clicking the Dependencies button at the top-right of your notebook. Here you will see three tabs: Packages, pre_install.sh, and post_install.sh.

Use packages to specify the python packages that you would like to install. When adding a new package, it will be pinned to the latest version of that package, but you can specify another version if preferred.

In order to install Stata packages via ssc, you should use the pre- and post- install shell scripts. These scripts are executed on either side of the python package installation, and are used to execute arbitrary code in the shell. Here you can execute stata code to run ssc install, and you can also use apt to install system packages (apt-get update && apt-get install -y <package>), or mamba to install from conda. E.g.

# A shell script. You can run Stata code here, 
# or install other system dependencies via apt-get

stata -e -q 'ssc install outreg2'

For notebooks that reference restricted data, internet will be disabled while the notebook is running. This means that the dependencies interface is the only place from which you can install dependencies; running ssc install within your notebook will fail.

Moreover, it is strongly recommended to always install your dependencies through the dependencies interface (regardless of whether your notebook has internet access), as this provides better reproducibility and documentation for future use.

Working with tabular data

In order to load data into Stata, we first pull it into a data frame in python, and then pass that variable into Stata. If you're unfamiliar with python, you can just copy+paste the below into the first cell of your notebook to load the data in python.

# We first load the table via python, and then pass the dataframe into stata
import redivis
df = redivis.table("_source_").to_pandas_dataframe(dtype_backend="numpy")

Next, in a separate cell, we use the %%stata "magic" at the start of our cell to specify that this is stata code. We include the -d df argument to pass in the df variable from python into Stata, and include the -force flag to tell Stata to overwrite any current dataset that we have.

%%stata -d df -force
/* Run stata code! All stata cells must be prefixed with %%stata */
describe()

Any subsequent cells that execute stata code should be prefixed by %%stata if they are more than one line, or by %stata if the code to be executed is all on one line:

%stata scatter mpg price

You can also use the %%mata command to execute Mata code:

%%mata
/* 
 Create the matrix X in Mata and then obtain its inverse, Xi. 
 Then, multiply Xi by the original matrix, X 
*/

X = (76, 53, 48 \ 53, 88, 46 \ 48, 46, 63)
Xi = invsym(X)
Xi
Xi*X

Working with geospatial data

Through various packages, Stata offers some support for geospatial datatypes. However, we can't pass geospatial data from python natively, and instead need to first create a shapefile that can then be loaded into Stata.

# This python code loads a geospatiale table and then writes it to a shapefile
geopandas_df = redivis.table("_source_").to_geopandas_dataframe(dtype_backend="numpy")
geopandas_df.to_file("out.shp")
%%stata
spshape2dta out.shp

Working with larger tables

If your data is too big to fit into memory, you may need to first download the data as a CSV, and then read that file into Stata:

import redivis
redivis.table("_source_").download("/scratch/table.csv", format="csv")
%%stata
import delimited "/scratch/table.csv"

Creating output tables

To create an output table, we first need to pass our Stata data back to python, using the -dout flag. We can then use the redivis.current_notebook().create_output_table() method in python to output our data.

If an output table for the notebook already exists, by default it will be overwritten. You can pass append=True to append, rather than overwrite, the table. In order for the append to succeed, all variables in the appended table, which are also present in the existing table, must have the same type.

%%stata -doutd df2
/*
  Once this cell executes, the current dataset will be pushed 
  to the python variable df2
*/
rename v* newv*
# Via python, pass this dataframe to the output table
# If append=True, subsequent calls will add to the existing table, 
#   rather than replacing it
redivis.current_notebook().create_output_table(df2, append=False)

Storing files

As you perform your analysis, you may generate files and figures that are stored on the notebook's hard disk. There are two locations that you should write files to: /out for persistent storage, and /scratch for temporary storage. By default, the output location is set to /scratch.

Any files written to persistent storage will be available when the notebook is stopped, and will be restored to the same state when the notebook is run again. Alternatively, any files written to temporary storage will only exist for the duration of the current notebook session.

%%stata
save "/out/my_dataset.dta"
outreg2 using /out/table.xls, replace

View the

View full documentation for the %%stata magic, including other helpful flags for moving data between python and Stata,

View the

Redivis notebooks offer the ability to materialize notebook outputs as a new in your workflow. This table can then be processed by transforms, read into other notebooks, exported, or even .

Table.to_pandas_dataframe() python documentation ->
here >
Table.to_geopandas_dataframe() python documentation ->
table node
re-imported into a dataset
python notebooks
official pystata library
unstructured data files
working with tabular data
their settings
python notebook base image
your workspace