Analyzing large tabular data

This guide demonstrates using a Redivis project to gather key variables from different tables, clean them, and and consolidate them into a single table for analysis in a notebook.

Project objective

We want to take weather data collected around the world and use it to understand how precipitation trends in the US have changed over time.

This project is on Redivis! We also suggest you recreate this project as we go to best learn the process.

1. Explore data

All the weather data we need is contained in the Demo organization dataset GHCN Daily Weather Data.

To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need. Some tables jump out as ones we will want to work with:

Daily observations

This table has nearly 3 billion records and seems to be the table in the dataset with the most information so we will start here. There is a variable that looks like it will be very helpful for our goal: element

If we click on the variable name, we can see a searchable frequency table with all this variable's values. One value that jumps out is PRCP which we can see on the value label represents the precipitation. Paired with this is the variable value which contains a numeric value recording the magnitude of the element. We can learn even more about this by clicking on the data dictionary in the dataset's overview page to see that this value is measured in either mm or inches.

We can see that this table doesn't contain any information about location though, so we'll need to find that somewhere else.

Stations

This table contains latitude and longitude variables as well as a linking variable id which will allow us to link in precipitation information from the Daily observations table, since it has the same id variable.

Since there are two tables we want information from we know that we will need to do a join as part of our data reshaping.

2. Create a project

At the top of this dataset page we can click the Analyze in project button to get started working with this data.

You can add this dataset to an existing project you already have access to, or create a new project to start from scratch.

3. Clean and reshape data

We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with. Even though we might be more comfortable with Python or R, the table we want to work with is 132GB and we would not be able to work with it in a notebook without specialized equipment.

Filter for precipitation

If we click on the element variable we can see that there are many types of elements recorded in this table. Since we only want information about precipitation (PRCP), we will start by filtering out any records with a different value in the element variable.

Start by creating a new transform on the Daily observations table, and add a Filter step. Configure the filter to select for records where element = PRCP.

Compute annual precipitation

Looking at the data, it's clear there are many observations of precipitation at each location over time. We know we want to look at annual trends so we can start by aggregating this data to only have one value per year per location.

Conceptually, we will need to aggregate on the year of each record, but if we look at the date variable in this table it also contains the month and day. For aggregation we will need a field which contains only the year which we can collapse on.

In this same transform we will add a Create variables step and make a new variable named year using the date extract method, which will pull the year out of the date value.

Then we will add an Aggregate step to aggregate on the values based on our new year variable. The first thing we will do is select the variables to aggregate on, or select which variables that we want all records with duplicate values to be grouped together. Since we want information on precipitation per station per year we should choose to aggregate on the id variable (station) and year variable. When executed this will group all records with the same combination of values in id and year, drop all other variables from the table, and then drop duplicate records down to just one record per unique combination in those two variables.

But the most important step here is that we want to gather information about the duplicate records that were dropped! If there were 400 records of precipitation in a year for a particular station, we want to know what those records all add up to. To do this we will Create a new aggregate variable within this aggregation step named annual_precip. We want to aggregate the information in the value column, since that contains the numeric amount of the corresponding element variable. Since we want the total amount across all dropped records we will use the Sum method.

Select variables and run

The final step in a transform is selecting which variables we would like to populate the resulting output table. Since we did an aggregation step, this list becomes limited to only the variables that we aggregated on. Since we want both of these variables in our output we make sure both are selected and visible in the right side list of the footer.

With everything in place we will run this transform to create a new table, by pressing the Run button in the top right corner!

4. Sanity check the output table

Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.

Table characteristics

Click on the output table below the transform to view it. We can see that it contains the two variables we expected it to, based on our variable selection. The table has almost 3 million records, down from the almost 3 billion in our original table, which makes sense given our aggregation step.

Variable characteristics

We can also inspect each variable further. If we click on the year variable we can see that all values are a four digit year value and we can see from the min and max values in the summary statistics that the years range from 1781 - 2021 which make sense. Looking at the frequency table we can see that we have a high frequency of each year represented in the records which is what we expect since there are multiple ids or locations per year.

If we click on the annual precipitation variable we can see a max value of 792,402 (mm or in) which seems very high but possible for a given year. The minimum value is -148,917 which doesn't seem right. When we look at the data dictionary it doesn't indicate that there should be negative values.

We can investigate our data further by clicking on the Query tab of the table and choosing "Sort on variable" from the templates menu, and sort on our annual_precip variable by ascending (ASC). It looks like there are a relatively few number of stations with negative values.

If we were doing a rigorous analysis of this data we might dig deeper into why these value exist in this dataset, or decide to exclude these outliers. However since this is an example we can decide to leave them in and see how it affects our outputs. We can easily go back and exclude them from the project at a later point.

5. Join geographic information

Now that we have a table we are satisfied with that contains annual precipitation information, we want to join in the latitude and longitude information about each station so we can make a geographical visual as part of our analysis.

This geographic information is included in the Stations table, so we will need to do a join which connects records from two tables.

While we could do this transformation step in our previous transform and rerun it, let's create a new transform to make this join so that we will have our previous output table saved to compare the output to.

In the new transform add a Join step. Set the joined table to the Stations table from our original dataset. The join type will depend on what variables we want to keep based on how the matching process goes. Since we only want to keep records that have both information from our source table (annual precipitation) AND information from our joined table (station lattitude and longitude) we set the join type to Inner join.

Since our identifier in both table is the variable id, set id from the source table (t0) as the left side of the join condition and set id from the joined table (t1) as the right side of the join condition. This means that for every record in our source table, look for a corresponding value in the Stations table and join the values from additional columns in that table.

When you set two variables equal to each other in a Join, they must be of the same type (string, integer, etc). In this example both id variables are string type so it works, but if one had been an integer we would have needed to retype one of them first to match.

There are a number of additional variables in the stations table but we are only interested in the latitude and longitude variables. In our variable selector we move all variables from our source table (year, id, annual_precip) as well as latitude and longitude.

With everything set up, run this transform and sanity check the output table! We started with a 132GB table and have narrowed it down to 134MB containing only the data we want in the format we will need.

6. Analyze outputs in a notebook

Next we want to generate an interactive visual to understand annual precipitation events in different locations over time. To do this we will create a Notebook node on our output table.

For this project we will use Python but you can also use R, Stata, or SAS if you'd like. When you create the notebook for the first time it will start up. Notebooks must be running to execute code.

Redivis notebooks come with many common packages preinstalled and we will use those in this example. If you'd like you can install additional packages by clicking the Dependencies button.

Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.

Reference data

Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook.

You can use this pandas code or replace it to use the dataframe of your choice. To use the standard code, click inside this cell and press the run button, or the Shift + Enter keys.

Select data from the United States

Now we will create a new cell to organize our work by pressing the + button in the top bar.

In this cell we want to limit our records to only ones whose latitude and longitude values fall in a specific range of the continental United States.

us_df = df[
    (df.latitude >= 24.396308)
    & (df.latitude <= 49.384358)
    & (df.longitude >= -124.848974)
    & (df.longitude <= -66.885444)

Compute average change

We want to see how our annual precipitation variable has changed at stations in the United states so for each station we will group by the id variable and compute an average for annual_precip.

# Compute avg precip for each station between 1980-2010
df_1980_2010_average = us_df[(us_df.year > 1980) & (us_df.year < 2010)].drop(columns=['year'])
df_1980_2010_average = df_1980_2010_average.groupby(['id']).mean()
df_1980_2010_average.rename(columns={"annual_precip": "_1980_2010_avg"}, inplace=True)

# Compute avg precip for each station between 1990-2020
df_1990_2020_average = us_df[(us_df.year > 1990) & (us_df.year < 2020)].drop(columns=['year'])
df_1990_2020_average = df_1990_2020_average.groupby(['id']).mean()
df_1990_2020_average.rename(columns={"annual_precip": "_1990_2020_avg"}, inplace=True)

diff_df = df_1980_2010_average.join(df_1990_2020_average["_1990_2020_avg"])
diff_df['deviation'] = (diff_df._1990_2020_avg - diff_df._1980_2010_avg) / diff_df._1980_2010_avg * 100
diff_df = diff_df[(diff_df.deviation < 25) & (diff_df.deviation > -25)]

diff_df = diff_df.dropna()
diff_df

Plot the results

Now that we have the information calculated, we can plot it! We'll use Plotly to do the heavy lifting, and point it at our relevant variables.

import plotly.figure_factory as ff

import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
px.set_mapbox_access_token("pk.eyJ1IjoiaW1hdGhld3MiLCJhIjoiY2thdnl2cGVsMGtldTJ6cGl3c2tvM2NweSJ9.TXtG4gARAf4bUbnPVxk6uA")

fig = ff.create_hexbin_mapbox(
    data_frame=diff_df, lat="latitude", lon="longitude",
    color="deviation",
    agg_func=np.mean,
    title="% Change precipitation, 1981-2010 vs 1991-2020",
    range_color=[-15,15],
    nx_hexagon=50, opacity=0.4, labels={"color": "Percent change"}, color_continuous_scale="Icefire_r",
)

fig.update_layout(margin=dict(b=0, t=0, l=0, r=0))
fig.show("nteract")

Now we've created an interactive figure showing change in precipitation by location in the United states.

Next steps

Perhaps we see something in this plot we want to investigate further, or we want to go back and change some of our assumptions. Projects are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them.

Notebooks can also create output tables which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also fork this project to work on a similar analysis, or export any table in this project for work elsewhere.

Last updated