Comment on page
This guide demonstrates using a Redivis project to gather key variables from different tables, clean them, and and consolidate them into a single table for analysis in Google Looker, all using point and click interfaces.
We'll use Gini coefficient data (a common measure of income distribution) from the American Community Survey (ACS) to build a map that shows the range of income inequality by US county in 2018.
- ACS Gini coefficient table
- ACS table that measures population by county
- National Weather Service dataset that maps a US county to lat/long coordinates.
Add this dataset to your project and click on the dataset node to see all of the available tables. Select Gini Index by County and can look at the data format by clicking Cells.
We want to map the Gini estimate and population by US county in our final data visualization. To map data points by county, we will need a variable that represents the latitude, longitude coordinates of that county. We also want the state, county name, and five-digit county code.
Therefore, we want our final dataset to include six distinct variables:
- Gini Estimate
- County Name
- County Code
- Latitude, Longitude
We will get Lat/Long and Population by joining our table with other tables later on in the project. For now, we can create a transform to start reshaping our data.
This method allows users to create a new variable with characters that match a specific pattern of the old variable. For example, creating a new variable for county, we can select all characters in NAME preceding the comma. We can add a new block and do a similar process on the NAME variable to get the State.
The last five digits in the GEO_ID variable represent the county code. We create another new variable block and name our new variable County_code. Select the Substring method and select all values starting at index 10 up to a max length of five.
Finally, we choose which variables to keep or discard in our transform. We’ll keep all variables except for GEO_ID, NAME (county and state combined), and B19083_001M (the margin of error for the Gini estimate).
Transform #1: Cleaning the data
Click the run button in the upper right-hand corner to run this transform.
Click on the newly generated output table to investigate the output of the transform. We can see that this table has the same number of records as our initial table and has the four variables we selected. If we click on the County and State variables we can see in the frequency tables that they look like we wanted them to.
Since this table looks like we expect we can move on to the next step! Otherwise we'd need to go back to the initial transform to change our inputs.
Transform #1: Output
We now have a table with the Gini estimate, state, county name, and county code, but we need the latitude and longitude information in order to map each county and we also want to account for the size of each county with a population variable. To do this, we'll perform two joins.
We could continue to work in our initial transform but we are choosing to create new transforms for each join to keep steps separated conceptually and provide output tables to reference along the way.
First, we will match each county code to its lat/long coordinates by joining with the "County to Lat/Long Coordinates" table.
A Left Join returns the complete source table (Gini Index by County) and all matching records from the right table (County to Lat/Long). We set the County_code variable in the source table equal to the County_code variable in the right table, which matches a set of lat/long coordinates to each county code (if there is no coordinate, it will return Null).
Transform #2: Matching lat/long coordinates to county
When you set two variables equal to each other in a Join, they must be of the same type. If, for example, you set a string type variable equal to an integer type variable, you will have to retype the string as an int (or vice versa).
Some county codes, however, map to more than one set of latitude, longitude coordinates so we must create two new variables in this transform:
unique_lat, we will partition on all kept variables except LAT and take the average of the LAT variable. This says that for that each county, compute the average of all possible latitudes and store that average in the variable
We will select all variables from the source table as well as
unique_longin our output table. Finally, we select "distinct" in our variable selector so that we will drop all records that are unique.
Transform #2: Creating Unique_lat and Unique_long variables
Running this transform outputs the following table, where each county now corresponds to a unique set of latitude and longitude variables.
Transform #2: Output
We now have all the desired variables in our dataset except for population per county, which we will need when we create the final visualization.
Using the same steps from the previous join, we will perform a second join to match each county to its total population by incorporating the Population by County table (also in the original dataset). We select a Left Join and set the County_code variable in the source table (Gini Index by County) equal to the County_code variable in the right table (Population by County).
Selecting all variables to remain in our output table and running this transform outputs the following table.
We will be using Google Looker Studio to visualize this data so we will need to update our data to make it work smoothly with their specified format for geographic data. Using their reference sheet, we can see that we will need our latitude and longitude information in the format:
Comma separated latitude and longitude decimal values (e.g., "51.5074,-0.1278" specifies to London, England)
We will need to combine our two separate values and add a comma between them.
We can do that using the Concat method for creating a new variable, but that method only accepts string inputs so we'll first need to retyping the
unique_longvariables from float to string using the Retype step.
Then we use the Create new variables step to make a new variable named
lattitude_longitudeand use the Concat method to concatenate the
unique_latstring, a string with a comma, and the
We can now discard
unique_longand keep the combined variable,
Running this transform yields the final table. From here, we can edit variable metadata and/or download our new dataset in the format that we'd like.
The next step would normally be to create a notebook in this project and use Python, R, or Stata to analyze this table. However if we want to create a quick visualization and aren't familiar with any of those coding languages we have easy options to export data to other systems such as Google Looker Studio.
For this example we will link this table directly to Google Looker Studio by clicking Export table on this table.
We could download our final table and the re-upload it into Looker Studio, but we choose to link the table through the Redivis connector so that if we come back to this original project and make changes they will be reflected in the visual that we're about to make there.
Follow the prompts of the connector to log in and authorize access, then we'll need to indicate the table we'd like to use. For that we'll need:
- The owner of the project's ID (shown in the top menu bar of the project)
- The name of the project (shown in the middle of the black project toolbar)
- The name of the table (shown on the title of the table node)
Note that the connector won't recognize some symbols (such as :) so you might need to update one of them if there is an issue connecting.
When the table is being imported we will have the option to change the type for variables. We'll need to change the type of our
lattitude_longitudevariable from string to the Lattitude, Longitude geography type.
Then click Create report to get started!
For this example we are going to build a bubble map that shows the size of population against the intensity of gini disparity in certain regions.
Get started by adding a chart and selecting the bubble map type. We will need to select Redivis as the data source for this map.
Then we will need to define which of our variables map to which parts of the visual. We will also want to change how they are aggregated, from SUM to AVG.
- Size: (AVG)
- Color metric: (SUM)
The resulting map is automatically interactive. Users can hover over each bubble and view the county name, population, and Gini coefficient. The bubble size is determined by the county population and the bubble color is determined by the Gini value. View the interactive report:
You can continue to create a variety of visualizations in this report, including scatter plots, Google maps, and stacked bar charts.
Perhaps we see something in this project we want to tweak, or we want to go back and change some of our data decisions. Projects are iterative and at any point you can go back and change our source data, our transform configuration or notebooks and rerun them.
Last modified 2d ago