# Analyzing large tabular data

This guide demonstrates using a Redivis workflow to gather key variables from different tables, clean them, and and consolidate them into a single table for analysis in a notebook.

## Workflow objective <a href="#starting-your-project" id="starting-your-project"></a>

We want to take weather data collected around the world and use it to understand how precipitation trends in the US have changed over time.

{% hint style="success" %}
[This workflow is on Redivis](https://redivis.com/projects/x7kh-5pvd4mbf1)! We also suggest you recreate this workflow as we go to best learn the process.
{% endhint %}

## 1. Explore data <a href="#starting-your-project" id="starting-your-project"></a>

All the weather data we need is contained in the Demo organization dataset [GHCN Daily Weather Data](https://redivis.com/datasets/7br5-41440fjzk).&#x20;

<figure><img src="/files/6KN9yl2IX87YYd3Y4dP9" alt=""><figcaption></figcaption></figure>

To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need. Some tables jump out as ones we will want to work with:

#### Daily observations

This table has nearly 3 billion records and seems to be the table in the dataset with the most information so we will start here. There is a variable that looks like it will be very helpful for our goal: `element`

If we click on the variable name, we can see a searchable frequency table with all this variable's values. One value that jumps out is `PRCP` which we can see on the value label represents the precipitation. Paired with this is the variable `value` which contains a numeric value recording the magnitude of the element. We can learn even more about this by clicking on the data dictionary in the dataset's overview page to see that this value is measured in either mm or inches.

We can see that this table doesn't contain any information about location though, so we'll need to find that somewhere else.&#x20;

{% embed url="<https://redivis.com/embed/tables/6fff-7hqb797cb>" %}

#### Stations

This table contains `latitude` and `longitude` variables as well as a linking variable `id` which will allow us to link in precipitation information from the Daily observations table, since it has the same `id` variable.&#x20;

Since there are two tables we want information from we know that we will need to do a join as part of our data reshaping.

{% embed url="<https://redivis.com/embed/tables/g2q3-fs847tjwe>" %}

## 2. Create a workflow

At the top of this dataset page we can click the **Analyze in workflow** button to get started working with this data.&#x20;

You can add this dataset to an existing workflow you already have access to, or create a new workflow to start from scratch.&#x20;

## 3. Clean and reshape data

We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with. Even though we might be more comfortable with Python or R, the table we want to work with is 132GB and we would not be able to work with it in a notebook without specialized equipment.

**Filter for precipitation**

If we click on the `element` variable we can see that there are many types of elements recorded in this table. Since we only want information about precipitation (`PRCP`), we will start by filtering out any records with a different value in the `element` variable.&#x20;

Start by creating a new transform on the **Daily observations** table, and add a [**Filter**](/reference/workflows/transforms/step-filter.md) step. Configure the filter to select for records where `element` `=` `PRCP`.

<figure><img src="/files/PXoG9BJRjXuJuOoYkjWX" alt=""><figcaption></figcaption></figure>

#### **Compute annual** precipitation

Looking at the data, it's clear there are many observations of precipitation at each location over time. We know we want to look at annual trends so we can start by aggregating this data to only have one value per year per location.&#x20;

Conceptually, we will need to aggregate on the year of each record, but if we look at the `date` variable in this table it also contains the month and day. For aggregation we will  need a field which contains only the year which we can collapse on.&#x20;

In this same transform we will add a [**Create variables**](/reference/workflows/transforms/step-create-variables.md) step and make a new variable named `year` using the **date extract** method, which will pull the year out of the date value.&#x20;

<figure><img src="/files/7D9RJmbmJ6EtLOtM1UGQ" alt=""><figcaption></figcaption></figure>

Then we will add an [**Aggregate**](/reference/workflows/transforms/variable-creation-methods/aggregate.md) step to aggregate on the values based on our new `year` variable. The first thing we will do is select the variables to aggregate on, or select which variables that we want all records with duplicate values to be grouped together. Since we want information on precipitation per *station* per *year* we should choose to aggregate on the `id` variable (station) and `year` variable. When executed this will group all records with the same combination of values in `id` and `year`, drop all other variables from the table, and then drop duplicate records down to just one record per unique combination in those two variables.

But the most important step here is that we want to gather information about the duplicate records that were dropped! If there were 400 records of precipitation in a year for a particular station, we want to know what those records all add up to. To do this we will **Create a new aggregate variable** within this aggregation step named `annual_precip`. We want to aggregate the information in the `value` column, since that contains the numeric amount of the corresponding `element` variable. Since we want the total amount across all dropped records we will use the `Sum` method.

<figure><img src="/files/maXYALlXYdtN79Y3ZDYi" alt=""><figcaption></figcaption></figure>

#### Select variables and run

The final step in a transform is selecting which variables we would like to populate the resulting output table. Since we did an aggregation step, this list becomes limited to only the variables that we aggregated on. Since we want both of these variables in our output we make sure both are selected and visible in the right side list of the footer.

<figure><img src="/files/1l9dGmwjWpBsHDSGVUZT" alt=""><figcaption></figcaption></figure>

With everything in place we will run this transform to create a new table, by pressing the **Run** button in the top right corner!

## 4. Sanity check the output table

Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.&#x20;

#### Table characteristics

Click on the output table below the transform to view it. We can see that it contains the two variables we expected it to, based on our variable selection. The table has almost 3 million records, down from the almost 3 billion in our original table, which makes sense given our aggregation step.

#### Variable characteristics

We can also inspect each variable further. If we click on the `year` variable we can see that all values are a four digit year value and we can see from the min and max values in the summary statistics that the years range from 1781 - 2021 which make sense. Looking at the frequency table we can see that we have a high frequency of each year represented in the records which is what we expect since there are multiple `id`s or locations per year.&#x20;

If we click on the annual precipitation variable we can see a max value of 792,402 (mm or in) which seems very high but possible for a given year. The minimum value is -148,917 which doesn't seem right. When we look at the data dictionary it doesn't indicate that there should be negative values.&#x20;

We can investigate our data further by clicking on the **Query** tab of the table and choosing "Sort on variable" from the templates menu, and sort on our `annual_precip` variable by ascending  (ASC). It looks like there are a relatively few number of stations with negative values.&#x20;

If we were doing a rigorous analysis of this data we might dig deeper into why these value exist in this dataset, or decide to exclude these outliers. However since this is an example we can decide to leave them in and see how it affects our outputs. We can easily go back and exclude them from the workflow at a later point.

<figure><img src="/files/Ay6KQF6e2c9GlZUDwHxc" alt=""><figcaption></figcaption></figure>

## 5. Join geographic information

Now that we have a table we are satisfied with that contains annual precipitation information, we want to join in the latitude and longitude information about each station so we can make a geographical visual as part of our analysis.&#x20;

This geographic information is included in the **Stations** table, so we will need to do a **join** which connects records from two tables.&#x20;

While we could do this transformation step in our previous transform and rerun it, let's create a new transform to make this join so that we will have our previous output table saved to compare the output to.&#x20;

In the new transform add a [**Join**](/reference/workflows/transforms/step-join.md) step. Set the joined table to the **Stations** table from our original dataset. The join type will depend on what variables we want to keep based on how the matching process goes. Since we only want to keep records that have both information from our source table (annual precipitation) AND information from our joined table (station lattitude and longitude) we set the join type to **Inner** join.&#x20;

Since our identifier in both table is the variable `id`, set `id` from the source table (t0) as the left side of the join condition and set `id` from the joined table (t1) as the right side of the join condition. This means that for every record in our source table, look for a corresponding value in the **Stations** table and join the values from additional columns in that table.

<figure><img src="/files/wuAtkkQe8BekVgEDnDVA" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
When you set two variables equal to each other in a Join, they must be of the same type (string, integer, etc). In this example both `id` variables are string type so it works, but if one had been an integer we would have needed to retype one of them first to match.&#x20;
{% endhint %}

There are a number of additional variables in the stations table but we are only interested in the `latitude` and `longitude` variables. In our variable selector we move all variables from our source table (`year`, `id`, `annual_precip`) as well as  `latitude` and `longitude`.

With everything set up, run this transform and sanity check the output table! We started with a 132GB table and have narrowed it down to 134MB containing only the data we want in the format we will need.

## 6. Analyze outputs in a notebook <a href="#finishing-your-project" id="finishing-your-project"></a>

Next we want to generate an interactive visual to understand annual precipitation events in different locations over time. To do this we will create a [**Notebook**](/reference/workflows/notebooks.md) node on our output table.&#x20;

For this workflow we will use Python but you can also use R, Stata, or SAS if you'd like. When you create the notebook for the first time it will start up. Notebooks must be running to execute code.&#x20;

{% hint style="info" %}
Redivis notebooks come with many common packages preinstalled and we will use those in this example. If you'd like you can install additional packages by clicking the **Dependencies** button.&#x20;

Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.
{% endhint %}

#### Reference data

Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook.&#x20;

You can use this pandas code or replace it to use the [dataframe](/reference/workflows/notebooks/notebook-concepts.md#loading-data) of your choice. To use the standard code, click inside this cell and press the run button, or the `Shift` + `Enter` keys.

#### Select data from the United States <a href="#visualizing-the-data" id="visualizing-the-data"></a>

Now we will create a new cell to organize our work by pressing the `+` button in the top bar.&#x20;

In this cell we want to limit our records to only ones whose latitude and longitude values fall in a specific range of the continental United States.&#x20;

```python
us_df = df[
    (df.latitude >= 24.396308)
    & (df.latitude <= 49.384358)
    & (df.longitude >= -124.848974)
    & (df.longitude <= -66.885444)
```

#### Compute average change <a href="#visualizing-the-data" id="visualizing-the-data"></a>

We want to see how our annual precipitation variable has changed at stations in the United states so for each station we will group by the `id` variable and compute an average for `annual_precip`.

```python
# Compute avg precip for each station between 1980-2010
df_1980_2010_average = us_df[(us_df.year > 1980) & (us_df.year < 2010)].drop(columns=['year'])
df_1980_2010_average = df_1980_2010_average.groupby(['id']).mean()
df_1980_2010_average.rename(columns={"annual_precip": "_1980_2010_avg"}, inplace=True)

# Compute avg precip for each station between 1990-2020
df_1990_2020_average = us_df[(us_df.year > 1990) & (us_df.year < 2020)].drop(columns=['year'])
df_1990_2020_average = df_1990_2020_average.groupby(['id']).mean()
df_1990_2020_average.rename(columns={"annual_precip": "_1990_2020_avg"}, inplace=True)

diff_df = df_1980_2010_average.join(df_1990_2020_average["_1990_2020_avg"])
diff_df['deviation'] = (diff_df._1990_2020_avg - diff_df._1980_2010_avg) / diff_df._1980_2010_avg * 100
diff_df = diff_df[(diff_df.deviation < 25) & (diff_df.deviation > -25)]

diff_df = diff_df.dropna()
diff_df
```

#### Plot the results

Now that we have the information calculated, we can plot it! We'll use [Plotly](https://plotly.com/python/) to do the heavy lifting, and point it at our relevant variables.

```python
import plotly.figure_factory as ff

import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
px.set_mapbox_access_token("pk.eyJ1IjoiaW1hdGhld3MiLCJhIjoiY2thdnl2cGVsMGtldTJ6cGl3c2tvM2NweSJ9.TXtG4gARAf4bUbnPVxk6uA")

fig = ff.create_hexbin_mapbox(
    data_frame=diff_df, lat="latitude", lon="longitude",
    color="deviation",
    agg_func=np.mean,
    title="% Change precipitation, 1981-2010 vs 1991-2020",
    range_color=[-15,15],
    nx_hexagon=50, opacity=0.4, labels={"color": "Percent change"}, color_continuous_scale="Icefire_r",
)

fig.update_layout(margin=dict(b=0, t=0, l=0, r=0))
fig.show("nteract")

```

Now we've created an interactive figure showing change in precipitation by location in the United states.&#x20;

<figure><img src="/files/PrPcjkdsRep8rwbR7AnM" alt=""><figcaption></figcaption></figure>

## Next steps

Perhaps we see something in this plot we want to investigate further, or we want to go back and change some of our assumptions. Workflows are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and [rerun](/reference/workflows/overview.md#run-all) them.

Notebooks can also [create output tables](/reference/workflows/notebooks/notebook-concepts.md#outputting-tables) which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also [fork](/reference/workflows/overview.md#fork-the-project) this workflow to work on a similar analysis, or [export](/reference/tables/exporting-tables.md) any table in this workflow for work elsewhere.&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.redivis.com/guides/analyze-data-in-a-workflow/example-workflows/analyzing-large-tabular-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
