Step: Join

Overview

A Join step will combine data from two tables based on a join condition so it can be queried together.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

Example output data:

Inner join where student = student

/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

Step structure

  • There will be at least one join block where you will define a table to join and a join condition for that table.

  • When multiple blocks exist, the tables will be joined in sequence.

Field descriptions

Field
Description

Join table

The table containing the data you wish to combine with this transform's data. Once selected, this table's variable will be marked t1 (or t2 etc. if other joins exist).

Join type

The way that the tables will be combined. (More information below.)

Source variable

The variable in your source table that you want to join on. All records in this variable will be matched against all records in the Joined table variable.

Joined table variable

The variable in your selected Join table that you want to join on. All records in this variable will be matched against all records in the Source variable.

Join types

Inner join

If a row in either the source or join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

Left join

If a row in the source doesn’t have a match, all joined variables will be null for that row. If a row in the join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

LEFT JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | zay     | 2        | NULL  |
 *---------+----------*-------*/

Right join

If a row in the source doesn’t have a match, it will be dropped. If a row in the join table doesn’t have a match, all source variables will be null for that row. If a row is matched multiple times, it will be multiplied in the output. This is the same as a left join, but reversed.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

RIGHT JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | toni    | NULL     | 30    |
 *---------+----------*-------*/

Full join

If a row in either the source or join table doesn’t have a match, all source variables will be null for that row. If a row is matched multiple times, it will be multiplied in the output.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

FULL JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | zay     | 2        | NULL  |
 | toni    | NULL     | 30    |
 *---------+----------*-------*/

Cross join

Every row in the source table will be combined with every row in the joined table. This might be used to perform a join on a new variable that will be created downstream (such as in a geospatial join). You will almost always need to use a filter after this join for the query to successfully execute.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
*----------+----------*/      *---------+-------*/

CROSS JOIN

Output:
/*---------+----------+---------+---------*
 | student | absences | student | score   |
 +---------+----------+---------+---------+
 | jane    | 0        | jane    | 85      |
 | jane    | 0        | sam     | 64      |
 | sam     | 6        | jane    | 85      |
 | sam     | 6        | sam     | 64      |
 *---------+----------*---------+---------*/

It is strongly recommended to use a Filter step with a cross join to avoid a massively expanded table.

Join structures

Depending on the structure of your data, joins might multiply in rows.

1-to-1 joins

If the variable you are joining on has no duplicate values in either the source table or the joined table, then your output table will always have the same or fewer rows than the sum of both table's rows added together.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

1-many joins

If the variable you are joining on has duplicates in either table, any time there is is a duplicate row it will be matched for each duplicate.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      | pat     | 86    |
                              *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | pat     | 1        | 86    |
 *---------+----------*-------*/

Many-to-many joins

If the variable you are joining on has duplicates in both tables, any time there is is a duplicate row it will be matched for each duplicate.

Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | pat     | 3        |       | pat     | 68    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | pat     | 1        | 86    |
 | pat     | 3        |     |
 | pat     | 3        | 86    |
 *---------+----------*-------*/

Examples

Example 1: Simple join condition

Let's say our information about student absenses and test scores are in separate tables and we want to join them together.

Starting data:

Source table (t0)             Table 1 (t1)    
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

Input fields:

  • Join table: The data we want to join is in Table 1 so we select it here.

  • Join type: Since we only care about students that have both absence and score information we choose inner join which will drop any rows without a match in both tables.

  • Source variable / Joined table variable: We want to match on the variable student which is present in both tables. So we select our source table (t0)'s student on the left, and our joined table (t1)'s student variable on the right.

    • Since this variable has the same name in both tables, they will be combined into one variable in the output table.

Output data:

/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/

Example 2: More specific join condition

Let's say instead of a table with aggregated absences we have a daily timeliness chart. We want to join the scores for the corresponding attendance information for both the student and date in question.

Starting data:

Source table (t0)                        Table 1 (t1)    
/*---------+------------+---------*      /*---------+------------+-------*
 | student | date       | on_time |       | student | date       | score |
 +---------+------------+---------+       +---------+------------+-------+
 | jane    | 2020-01-01 | TRUE    |       | jane    | 2020-01-01 | 85    |
 | sam     | 2020-01-01 | FALSE   |       | sam     | 2020-01-01 | 65    |
 | jane    | 2020-02-01 | FALSE   |       *---------+------------+-------*/
*----------+------------+--------*/     

Input fields:

Join type: Since we want to keep our absence data whether a test was taken or not, we do a Left join.

Join condition: We join on fields where BOTH the values in the student and date variables are a match by selecting both in our join condition.

Output data:

/*---------+------------+---------+-------*
 | student | date       | on_time | score |
 +---------+------------+---------+-------+
 | jane    | 2020-01-01 | TRUE    | 85    |
 | sam     | 2020-01-01 | FALSE   | 65    |
 | jane    | 2020-02-01 | FALSE   | NULL  |
*----------+------------+---------+-------*/     

You can use a SQL query step to build more complex join conditions

Example 3: Complex join conditions with cross joins

In some cases you might need to do a more complex join, such as using an operator other than =, or joining on a condition that uses a variable from your joined table that needs to be retyped or created first. This is common in geospatial joins but might come up in any situation.

To use a more complex join condition you can use a cross join, followed by filter. The cross join will make all possible joins between these two tables and then the filter will limit it to rows matching your condition. Effectively this will execute the same as if you had done your initial join with a more complex condition.

Let's say in the example below we want to do an inner join where T0 student_id = T1 student_id but in the source table student_id is an integer type variable and in Table 1 it is a string. We want to retype the variable in Table 1 before we can join on it, but we can't retype it until we have joined in Table 1. So we will do a cross join, a retype, then a filter to narrow down to only records meeting our condition.

Starting data:

Source table (t0)                Table 1 (t1)    
/*------------+----------*      /*------------+-------*
 | student_id | absences |       | student_id | score |
 +------------+----------+       +------------+-------+
 | 01         | 0        |       | 01         | 85    |
 | 02         | 6        |       | 02         | 64    |
 | 03         | 1        |       | 03         | 88    |
*-------------+----------*/      *------------+-------*/

Input fields:

Output data:

/*------------+----------+-------*
 | student_id | absences | score |
 +------------+----------+-------+
 | jane       | 0        | 85    |
 | sam        | 6        | 64    |
 | pat        | 1        | 88    |
 *------------+----------*-------*/

Last updated