# Stata notebooks

## Overview

Stata notebooks are available for those researchers who are more comfortable using Stata and its ecosystem. These are built off the same base image as [python notebooks](https://docs.redivis.com/reference/workflows/notebooks/python-notebooks), but include the [official pystata library](https://www.stata.com/features/overview/jupyter-notebooks/) to allow for the execution of Stata in a notebook environment.&#x20;

Working with Stata in a notebook environment is slightly different than the Stata desktop application, in that we need to utilize python to pass data into Stata. This step is quite simple, and doesn't require any expertise in python – see [working with tabular data](#working-with-tabular-data) below.

{% hint style="info" %}
While Stata is fully supported on Redivis, certain Redivis concepts, such as [unstructured data files](https://docs.redivis.com/reference/datasets/create-and-edit-datasets/import-files), don't have a corollary in Stata. Moreover, Stata doesn't support the sorts of parallelized stream processing available in Python and R.
{% endhint %}

## Enabling Stata notebooks&#x20;

Because Stata is proprietary software, you will need to provide a license for Stata 16 or later in order to enable Stata notebooks on Redivis. Organizations can specify license information in [their settings](https://docs.redivis.com/organizations/settings#stata-and-sas-licenses), which will make Stata notebooks available to all members of their organization. Alternatively, you can provide your own stata license in [your workspace](https://docs.redivis.com/your-account/workspace#advanced).

{% hint style="info" %}
In the Jupyter-Stata documentation, you may see references to configuring stata via the `stata_setup` command. There is no need to run this command in Stata notebooks on Redivis, as everything has been pre-configured.
{% endhint %}

## Base image and dependencies

Stata notebooks are based off the [python notebook base image](https://docs.redivis.com/reference/workflows/python-notebooks#base-image-and-dependencies), and can combine both Stata and Python dependencies to create novel workflows.

To further customize your compute environment, you can specify various dependencies by clicking the **Dependencies** button at the top-right of your notebook. Here you will see three tabs: **Packages, pre\_install.sh**, and **post\_install.sh**.

Use packages to specify the *python* packages that you would like to install. When adding a new package, it will be pinned to the latest version of that package, but you can specify another version if preferred.

In order to install Stata packages via `ssc`, you should use the pre- and post- install shell scripts. These scripts are executed on either side of the python package installation, and are used to execute arbitrary code in the shell. Here you can execute stata code to run `ssc install`, and you can also use `apt` to install system packages (`apt-get update && apt-get install -y <package>`), or `mamba` to install from conda. E.g.

<pre class="language-bash"><code class="lang-bash"># A shell script. You can run Stata code here, 
# or install other system dependencies via apt-get
<strong>
</strong>stata -e -q 'ssc install outreg2'
</code></pre>

{% hint style="info" %}
For notebooks that reference restricted data, internet will be disabled while the notebook is running. This means that the dependencies interface is the *only* place from which you can install dependencies; running `ssc install` within your notebook will fail.

Moreover, it is strongly recommended to always install your dependencies through the dependencies interface (regardless of whether your notebook has internet access), as this provides better reproducibility and documentation for future use.
{% endhint %}

## Working with tabular data

In order to load data into Stata, we use a one-liner to pull any Redivis table, or SQL query result, into Stata. All variable metadata, including value labels, will be loaded with the data. This will overwrite any working dataset that we currently have in our Stata session.

```python
import redivis
redivis.table("_source_").to_stata()

# We can also reference any table in the workflow by name
redivis.table("transform_1_output").to_stata()

# Or run a SQL query and load its results into Stata
redivis.query("SELECT * FROM _source_ LIMIT 10").to_stata()
```

Next, in a separate cell, we use the `%%stata` "magic" at the start of our cell to specify that this is stata code.&#x20;

<pre class="language-stata"><code class="lang-stata">%%stata
/* Run stata code! All stata cells must be prefixed with %%stata */
<strong>describe()
</strong></code></pre>

Any subsequent cells that execute stata code should be prefixed by `%%stata` if they are more than one line, or by `%stata` if the code to be executed is all on one line:

```stata
%stata scatter mpg price
```

{% hint style="info" %}
View full documentation for the `%%stata` magic, including other helpful flags for moving data between python and Stata, [here >](https://www.stata.com/python/pystata17/notebook/Magic%20Commands1.html#Arguments)
{% endhint %}

You can also use the `%%mata` command to execute Mata code:

```mata
%%mata
/* 
 Create the matrix X in Mata and then obtain its inverse, Xi. 
 Then, multiply Xi by the original matrix, X 
*/

X = (76, 53, 48 \ 53, 88, 46 \ 48, 46, 63)
Xi = invsym(X)
Xi
Xi*X
```

## Working with geospatial data

Loading geospatial data looks much the same as other tabular data, in that we can just call the `.to_stata()` method on any table or query in python. If that table or query result contains a geography variable, it will automatically be loaded as geospatial data to Stata via the `spshape2dta` package.

In the uncommon case where your table has multiple variables of type `geography`, you'll need to explicitly specify which variable to use as the geography.

If you want to bypass the default behavior and load a table with a geography variable as a standard Stata table, you can explicitly set `.to_stata(geography_variable = None)`.

```python
import redivis

redivis.table("some_geo_table").to_stata()

# If our table has more than one variable of type 'geography', 
#   we must specify the `geography_variable` param. 
redivis.table("some_geo_table").to_stata(geography_variable="geo_var")
```

```stata
%%stata
// Data has been loaded via the spshape2dta package
describe()
```

## Creating output tables

Redivis notebooks offer the ability to materialize notebook outputs as a new [table node](https://docs.redivis.com/reference/workflows/tables) in your workflow. This table can then be processed by transforms, read into other notebooks, exported, or even [re-imported into a dataset](https://docs.redivis.com/guides/create-and-manage-datasets/cleaning-tabular-data).

To create an output table, we must first save our Stata data set to a `.dta` file. We can then use the `redivis.current_notebook().create_output_table()` method in python to output our data.

If an output table for the notebook already exists, by default it will be overwritten. You can pass `append=True` to append, rather than overwrite, the table. In order for the append to succeed, all variables in the appended table, which are also present in the existing table, must have the same type.

```stata
%stata save "/scratch/out.dta", replace
```

```python
# Run in a separate cell, this is python code
redivis.current_notebook().create_output_table("/scratch/out.dta", append=False)
```

## Storing files

As you perform your analysis, you may generate files and figures that are stored on the notebook's hard disk. There are two locations that you should write files to: `/out` for persistent storage, and `/scratch` for temporary storage. By default, the output location is set to `/scratch`.

Any files written to persistent storage will be available when the notebook is stopped, and will be restored to the same state when the notebook is run again. Alternatively, any files written to temporary storage will only exist for the duration of the current notebook session.

```stata
%%stata
save "/out/my_dataset.dta"
outreg2 using /out/table.xls, replace
```
