Full project walkthrough

Overview

This guide uses the Redivis project tool to gather key variables from different tables, clean them, and and consolidate them into a single table for use in any analytical tool or visualization builder.

We'll use Gini coefficient data (a common measure of income distribution) from the American Community Survey (ACS) to build a map that shows the range of income inequality by US county in 2018.

This project is on Redivis! Copy the project to follow along during this walkthrough

1. Starting your project

You can follow along with this project here. We also suggest you recreate this project as we go to best learn the process. To get started, navigate to your workspace and create a new project.

We have already uploaded all relevant data for this project into a single dataset called Example Project Dataset which contains contains three tables:

  • ACS Gini coefficient table

  • ACS table that measures population by county

  • National Weather Service dataset that maps a US county to lat/long coordinates.

Add this dataset to your project and you can click on the node to see all of the available tables within it. We will select the table we want to work with first--Gini Index by County--and can look at the data format by clicking Cells.

We want to map the Gini estimate and population by US county in our final data visualization. To map data points by county, we will need a variable that represents the latitude, longitude coordinates of that county. We also want the state, county name, and five-digit county code.

Therefore, we want our final dataset to include six distinct variables:‌

  • Gini Estimate

  • State

  • County Name

  • County Code

  • Latitude, Longitude

  • Population

We will get Lat/Long and Population by merging our table with other tables later on in the project. For now, we can select the transform button and make the following changes to finalize and clean the first four variables.‌

  • Gini Estimate: We already have this variable in our dataset: B19083_001E. In Build, we can right-click on B19083_001E, select Edit Variable, and rename it to Gini_estimate. If we wanted to, we could also recast the variable to a different type here. See Casting.

  • County and State Names: We want to break up our NAME variable into two separate columns, county and state using the Regexp extract feature under String. This feature allows users to create a new variable with characters that match a specific pattern of the old variable. For example, creating a new variable for county, we can select all characters in NAME preceding the comma.

  • County Code: The last five digits in the GEO_ID variable represent the county code. We create a new variable titled County_code, select Substring under String, and select all values starting at index 10 up to a max length of five.

Finally, we can choose which variables to keep or discard in our transform. We’ll keep all variables except for GEO_ID, NAME (county and state combined), and B19083_001M (the margin of error for the Gini estimate).

Transform #1: Cleaning the data‌

Click the run button in the upper right-hand corner to run this transform. Below is the resulting table.

Transform #1: Output‌

2. Joining tables

Matching county to latitude, longitude data‌

We now have a table with the Gini estimate, state, county name, and county code, but we need the latitude and longitude information in order to map each county and we also want to account for the size of each county with a population variable. To do this, we'll perform two joins.

If the table we want to join with is not already in the project, we must select Add Dataset in the upper left-hand corner to add our desired table to the project. In this example, all tables are contained within the starting Example Project Dataset. First, we will match each county code to its lat/long coordinates by joining with the "County to Lat/Long Coordinates" table.

A Left Join returns the complete source table (Gini Index by County) and all matching records from the right table (County to Lat/Long). We set the County_code variable in the source table equal to the County_code variable in the right table, which matches a set of lat/long coordinates to each county code (if there is no coordinate, it will return Null).

When you set two variables equal to each other in a Join, they must be of the same type. If, for example, you set a string type variable equal to an integer type variable, you will have to cast the string as an int (or vice versa).

Transform #2: Matching lat/long coordinates to county‌

Some county codes, however, map to more than one set of latitude, longitude coordinates so we must create two new Partitioned variables in this transform: Unique_lat and Unique_long. For Unique lat, we will partition on all kept variables except LAT and take the average of the LAT variable. This says that for that each county, compute the average of all possible latitudes and store that average in the variable Unique_lat.

We will keep all variables from the source table as well as Unique_lat and Unique_long. Finally, we select "Drop rows which are exact duplicates across all kept variables."

Transform #2: Creating Unique_lat and Unique_long variables‌

Running this transform outputs the following table, where each county now corresponds to a unique set of latitude and longitude variables.

Transform #2: Output‌

Matching county to population data

We now have all the desired variables in our dataset except for population per county, which we will need when we create the final visualization.‌

Using the same steps from the first join, we will perform a second join to match each county to its total population by incorporating the Population by County table (also in the original dataset). We select a Left Join and set the County_code variable in the source table (Gini Index by County) equal to the County_code variable in the right table (Population by County).

Transform #3: Matching population data to county‌

Running this transform outputs the following table.

Transform #3: Output‌

3. Finishing your project

We will make one change in our final transform to organize the data, which is to create a new variable: Latitude_longitude. When creating maps in Google Data Studio, the specified dimension is "Latitude, Longitude" so we will combine our two coordinate variables into one.‌

In Build, we cast the Unique_lat and Unique_long variables from float types to string types. In New variables, we select the Concat method under String and concatenate the Unique_lat string, a string with a comma and space, and the Unique_long string. We can now discard Unique_lat and Unique_long and keep the combined variable, Latitude_longitude.

Transform #4: Final cleaninig of the data‌

Running this transform yields the final table. From here, we can edit variable metadata and/or download our new dataset in the format that we'd like.

Transform #4: Output

All six variables—Gini Estimate, State, County, County Code, Lat/Long, and Population—are in our table!‌

4. Visualizing the data

We can utilize datasets from Redivis to create powerful visualizations on platforms such as Observable, JupyterLab, and Google Data Studio. Specifically, the Redivis DataStudio connector allows users to sync up datasets and projects on Redivis with their Google Data Studio account to streamline the data visualization process.‌

In this example, we will simply export the final dataset created in the project as a .csv file and import it into Data Studio under Add data, File Upload. From here, we can create a variety of visualizations, such as scatter plots, Google maps, and stacked bar charts. Refer to the Data Studio Help pages for more specific guidance.‌

The data from this project allows us to design the following interactive map that illustrates the spectrum of income inequality by US county in 2018. Users can hover over each bubble and view the county name, population, and Gini coefficient. The bubble size is determined by the county population and the bubble color is determined by the Gini value according to the key shown below.

Gini Index by US County, 2018