Analyzing large tabular data
Last updated
Last updated
This guide demonstrates using a Redivis project to gather key variables from different tables, clean them, and and consolidate them into a single table for analysis in a notebook.
We want to take weather data collected around the world and use it to understand how precipitation trends in the US have changed over time.
This project is on Redivis! We also suggest you recreate this project as we go to best learn the process.
All the weather data we need is contained in the Demo organization dataset GHCN Daily Weather Data.
To get started we want to understand this dataset and what information is in each table. We can look at the dataset page to learn more about it, including its overview information, metadata, and variable summary statistics. Since this dataset is public we can also look directly at the data to confirm it has the information we need. Some tables jump out as ones we will want to work with:
This table has nearly 3 billion records and seems to be the table in the dataset with the most information so we will start here. There is a variable that looks like it will be very helpful for our goal: element
If we click on the variable name, we can see a searchable frequency table with all this variable's values. One value that jumps out is PRCP
which we can see on the value label represents the precipitation. Paired with this is the variable value
which contains a numeric value recording the magnitude of the element. We can learn even more about this by clicking on the data dictionary in the dataset's overview page to see that this value is measured in either mm or inches.
We can see that this table doesn't contain any information about location though, so we'll need to find that somewhere else.
This table contains latitude
and longitude
variables as well as a linking variable id
which will allow us to link in precipitation information from the Daily observations table, since it has the same id
variable.
Since there are two tables we want information from we know that we will need to do a join as part of our data reshaping.
At the top of this dataset page we can click the Analyze in project button to get started working with this data.
You can add this dataset to an existing project you already have access to, or create a new project to start from scratch.
We will use transforms to clean the data, as they are best suited for reshaping data and will quickly output new tables we can continue to work with. Even though we might be more comfortable with Python or R, the table we want to work with is 132GB and we would not be able to work with it in a notebook without specialized equipment.
Filter for precipitation
If we click on the element
variable we can see that there are many types of elements recorded in this table. Since we only want information about precipitation (PRCP
), we will start by filtering out any records with a different value in the element
variable.
Start by creating a new transform on the Daily observations table, and add a Filter step. Configure the filter to select for records where element
=
PRCP
.
Looking at the data, it's clear there are many observations of precipitation at each location over time. We know we want to look at annual trends so we can start by aggregating this data to only have one value per year per location.
Conceptually, we will need to aggregate on the year of each record, but if we look at the date
variable in this table it also contains the month and day. For aggregation we will need a field which contains only the year which we can collapse on.
In this same transform we will add a Create variables step and make a new variable named year
using the date extract method, which will pull the year out of the date value.
Then we will add an Aggregate step to aggregate on the values based on our new year
variable. The first thing we will do is select the variables to aggregate on, or select which variables that we want all records with duplicate values to be grouped together. Since we want information on precipitation per station per year we should choose to aggregate on the id
variable (station) and year
variable. When executed this will group all records with the same combination of values in id
and year
, drop all other variables from the table, and then drop duplicate records down to just one record per unique combination in those two variables.
But the most important step here is that we want to gather information about the duplicate records that were dropped! If there were 400 records of precipitation in a year for a particular station, we want to know what those records all add up to. To do this we will Create a new aggregate variable within this aggregation step named annual_precip
. We want to aggregate the information in the value
column, since that contains the numeric amount of the corresponding element
variable. Since we want the total amount across all dropped records we will use the Sum
method.
The final step in a transform is selecting which variables we would like to populate the resulting output table. Since we did an aggregation step, this list becomes limited to only the variables that we aggregated on. Since we want both of these variables in our output we make sure both are selected and visible in the right side list of the footer.
With everything in place we will run this transform to create a new table, by pressing the Run button in the top right corner!
Now that we have created a new table, we can inspect it to make sure our steps accomplished what we expected them to.
Click on the output table below the transform to view it. We can see that it contains the two variables we expected it to, based on our variable selection. The table has almost 3 million records, down from the almost 3 billion in our original table, which makes sense given our aggregation step.
We can also inspect each variable further. If we click on the year
variable we can see that all values are a four digit year value and we can see from the min and max values in the summary statistics that the years range from 1781 - 2021 which make sense. Looking at the frequency table we can see that we have a high frequency of each year represented in the records which is what we expect since there are multiple id
s or locations per year.
If we click on the annual precipitation variable we can see a max value of 792,402 (mm or in) which seems very high but possible for a given year. The minimum value is -148,917 which doesn't seem right. When we look at the data dictionary it doesn't indicate that there should be negative values.
We can investigate our data further by clicking on the Query tab of the table and choosing "Sort on variable" from the templates menu, and sort on our annual_precip
variable by ascending (ASC). It looks like there are a relatively few number of stations with negative values.
If we were doing a rigorous analysis of this data we might dig deeper into why these value exist in this dataset, or decide to exclude these outliers. However since this is an example we can decide to leave them in and see how it affects our outputs. We can easily go back and exclude them from the project at a later point.
Now that we have a table we are satisfied with that contains annual precipitation information, we want to join in the latitude and longitude information about each station so we can make a geographical visual as part of our analysis.
This geographic information is included in the Stations table, so we will need to do a join which connects records from two tables.
While we could do this transformation step in our previous transform and rerun it, let's create a new transform to make this join so that we will have our previous output table saved to compare the output to.
In the new transform add a Join step. Set the joined table to the Stations table from our original dataset. The join type will depend on what variables we want to keep based on how the matching process goes. Since we only want to keep records that have both information from our source table (annual precipitation) AND information from our joined table (station lattitude and longitude) we set the join type to Inner join.
Since our identifier in both table is the variable id
, set id
from the source table (t0) as the left side of the join condition and set id
from the joined table (t1) as the right side of the join condition. This means that for every record in our source table, look for a corresponding value in the Stations table and join the values from additional columns in that table.
When you set two variables equal to each other in a Join, they must be of the same type (string, integer, etc). In this example both id
variables are string type so it works, but if one had been an integer we would have needed to retype one of them first to match.
There are a number of additional variables in the stations table but we are only interested in the latitude
and longitude
variables. In our variable selector we move all variables from our source table (year
, id
, annual_precip
) as well as latitude
and longitude
.
With everything set up, run this transform and sanity check the output table! We started with a 132GB table and have narrowed it down to 134MB containing only the data we want in the format we will need.
Next we want to generate an interactive visual to understand annual precipitation events in different locations over time. To do this we will create a Notebook node on our output table.
For this project we will use Python but you can also use R, Stata, or SAS if you'd like. When you create the notebook for the first time it will start up. Notebooks must be running to execute code.
Redivis notebooks come with many common packages preinstalled and we will use those in this example. If you'd like you can install additional packages by clicking the Dependencies button.
Since this notebook contains only public data we can install packages at any time, but for restricted data notebooks do not have internet access and packages can only be installed when they are stopped.
Newly created notebooks come with standard code to import the Redivis library and reference the source table in a pandas dataframe within the notebook.
You can use this pandas code or replace it to use the dataframe of your choice. To use the standard code, click inside this cell and press the run button, or the Shift
+ Enter
keys.
Now we will create a new cell to organize our work by pressing the +
button in the top bar.
In this cell we want to limit our records to only ones whose latitude and longitude values fall in a specific range of the continental United States.
We want to see how our annual precipitation variable has changed at stations in the United states so for each station we will group by the id
variable and compute an average for annual_precip
.
Now that we have the information calculated, we can plot it! We'll use Plotly to do the heavy lifting, and point it at our relevant variables.
Now we've created an interactive figure showing change in precipitation by location in the United states.
Perhaps we see something in this plot we want to investigate further, or we want to go back and change some of our assumptions. Projects are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them.
Notebooks can also create output tables which allow you to sanity check the work we did in the notebook or perhaps create a table to use in another notebook or transform. You can also fork this project to work on a similar analysis, or export any table in this project for work elsewhere.