This guide uses the Redivis project tool to gather key variables from different tables, clean them, and and consolidate them into a single table for use in any analytical tool or visualization builder.
We'll use Gini coefficient data (a common measure of income distribution) from the American Community Survey (ACS) to build a map that shows the range of income inequality by US county in 2018.
We have already uploaded all relevant data for this project into a single dataset called Example Project Dataset which contains contains three tables:
ACS Gini coefficient table
ACS table that measures population by county
National Weather Service dataset that maps a US county to lat/long coordinates.
Add this dataset to your project and you can click on the node to see all of the available tables within it. We will select the table we want to work with first--Gini Index by County--and can look at the data format by clicking Cells.
We want to map the Gini estimate and population by US county in our final data visualization. To map data points by county, we will need a variable that represents the latitude, longitude coordinates of that county. We also want the state, county name, and five-digit county code.
Therefore, we want our final dataset to include six distinct variables:
We will get Lat/Long and Population by merging our table with other tables later on in the project. For now, we can select the transform button and make the following changes to finalize and clean the first four variables.
Gini Estimate: We already have this variable in our dataset: B19083_001E. In Build, we can right-click on B19083_001E, select Edit Variable, and rename it to Gini_estimate. If we wanted to, we could also recast the variable to a different type here. See Casting.
County and State Names: We want to break up our NAME variable into two separate columns, county and state using the Regexp extract feature under String. This feature allows users to create a new variable with characters that match a specific pattern of the old variable. For example, creating a new variable for county, we can select all characters in NAME preceding the comma.
County Code: The last five digits in the GEO_ID variable represent the county code. We create a new variable titled County_code, select Substring under String, and select all values starting at index 10 up to a max length of five.
Finally, we can choose which variables to keep or discard in our transform. We’ll keep all variables except for GEO_ID, NAME (county and state combined), and B19083_001M (the margin of error for the Gini estimate).
Click the run button in the upper right-hand corner to run this transform. Below is the resulting table.
We now have a table with the Gini estimate, state, county name, and county code, but we need the latitude and longitude information in order to map each county and we also want to account for the size of each county with a population variable. To do this, we'll perform two joins.
If the table we want to join with is not already in the project, we must select Add Dataset in the upper left-hand corner to add our desired table to the project. In this example, all tables are contained within the starting Example Project Dataset. First, we will match each county code to its lat/long coordinates by joining with the "County to Lat/Long Coordinates" table.
A Left Join returns the complete source table (Gini Index by County) and all matching records from the right table (County to Lat/Long). We set the County_code variable in the source table equal to the County_code variable in the right table, which matches a set of lat/long coordinates to each county code (if there is no coordinate, it will return Null).
Some county codes, however, map to more than one set of latitude, longitude coordinates so we must create two new Partitioned variables in this transform: Unique_lat and Unique_long. For Unique lat, we will partition on all kept variables except LAT and take the average of the LAT variable. This says that for that each county, compute the average of all possible latitudes and store that average in the variable Unique_lat.
We will keep all variables from the source table as well as Unique_lat and Unique_long. Finally, we select "Drop rows which are exact duplicates across all kept variables."
Running this transform outputs the following table, where each county now corresponds to a unique set of latitude and longitude variables.
We now have all the desired variables in our dataset except for population per county, which we will need when we create the final visualization.
Using the same steps from the first join, we will perform a second join to match each county to its total population by incorporating the Population by County table (also in the original dataset). We select a Left Join and set the County_code variable in the source table (Gini Index by County) equal to the County_code variable in the right table (Population by County).
Running this transform outputs the following table.
We will make one change in our final transform to organize the data, which is to create a new variable: Latitude_longitude. When creating maps in Google Data Studio, the specified dimension is "Latitude, Longitude" so we will combine our two coordinate variables into one.
In Build, we cast the Unique_lat and Unique_long variables from float types to string types. In New variables, we select the Concat method under String and concatenate the Unique_lat string, a string with a comma and space, and the Unique_long string. We can now discard Unique_lat and Unique_long and keep the combined variable, Latitude_longitude.
Running this transform yields the final table. From here, we can edit variable metadata and/or download our new dataset in the format that we'd like.
We can utilize datasets from Redivis to create powerful visualizations on platforms such as Observable, JupyterLab, and Google Data Studio. Specifically, the Redivis DataStudio connector allows users to sync up datasets and projects on Redivis with their Google Data Studio account to streamline the data visualization process.
In this example, we will simply export the final dataset created in the project as a .csv file and import it into Data Studio under Add data, File Upload. From here, we can create a variety of visualizations, such as scatter plots, Google maps, and stacked bar charts. Refer to the Data Studio Help pages for more specific guidance.
The data from this project allows us to design the following interactive map that illustrates the spectrum of income inequality by US county in 2018. Users can hover over each bubble and view the county name, population, and Gini coefficient. The bubble size is determined by the county population and the bubble color is determined by the Gini value according to the key shown below.