Comment on page
Step: Join
A Join step will combine data from two tables based on a join condition so it can be queried together.
Table A: Table B:
/*---------+----------* /*---------+-------*
| student | absences | | student | score |
+---------+----------+ +---------+-------+
| jane | 0 | | jane | 85 |
| sam | 6 | | sam | 64 |
| pat | 1 | | pat | 88 |
*----------+----------*/ *---------+-------*/
Inner join where
student
= student
/*---------+----------+-------*
| student | absences | score |
+---------+----------+-------+
| jane | 0 | 85 |
| sam | 6 | 64 |
| pat | 1 | 88 |
*---------+----------*-------*/

- There will be at least one join block where you will define a table to join and a join condition for that table.
- When multiple blocks exist, the tables will be joined in sequence.
Field | Description |
---|---|
Join table | The table containing the data you wish to combine with this transform's data.
Once selected, this table's variable will be marked t1 (or t2 etc. if other joins exist). |
Join type | The way that the tables will be combined. (More information below.) |
Source variable | The variable in your source table that you want to join on. All records in this variable will be matched against all records in the Joined table variable. |
Joined table variable | The variable in your selected Join table that you want to join on. All records in this variable will be matched against all records in the Source variable. |
If a row in either the source or join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.
Table A: Table B:
/*---------+----------* /*---------+-------*
| student | absences | | student | score |
+---------+----------+ +---------+-------+
| jane | 0 | | jane | 85 |
| sam | 6 | | sam | 64 |
| pat | 1 | | pat | 88 |
*----------+----------*/ *---------+-------*/
INNER JOIN on student
Output:
/*---------+----------+-------*
| student | absences | score |
+---------+----------+-------+
| jane | 0 | 85 |
| sam | 6 | 64 |
| pat | 1 | 88 |
*---------+----------*-------*/
If a row in the source doesn’t have a match, all joined variables will be null for that row. If a row in the join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.
Table A: Table B:
/*---------+----------* /*---------+-------*
| student | absences | | student | score |
+---------+----------+ +---------+-------+
| jane | 0 | | jane | 85 |
| sam | 6 | | sam | 64 |
| pat | 1 | | pat | 88 |
| zay | 2 | | toni | 30 |
*----------+----------*/ *---------+-------*/
LEFT JOIN on student
Output:
/*---------+----------+-------*
| student | absences | score |
+---------+----------+-------+
| jane | 0 | 85 |
| sam | 6 | 64 |
| pat | 1 | 88 |
| zay | 2 | NULL |
*---------+----------*-------*/
If a row in the source doesn’t have a match, it will be dropped. If a row in the join table doesn’t have a match, all source variables will be null for that row. If a row is matched multiple times, it will be multiplied in the output. This is the same as a left join, but reversed.
Table A: Table B:
/*---------+----------* /*---------+-------*
| student | absences | | student | score |
+---------+----------+ +---------+-------+
| jane | 0 | | jane | 85 |
| sam | 6 | | sam | 64 |
| pat | 1 | | pat | 88 |
| zay | 2 | | toni | 30 |
*----------+----------*/ *---------+-------*/
RIGHT JOIN on student
Output:
/*---------+----------+-------*
| student | absences | score |
+---------+----------+-------+
| jane | 0 | 85 |
| sam | 6 | 64 |
| pat | 1 | 88 |
| toni | NULL | 30 |
*---------+----------*-------*/
If a row in either the source or join table doesn’t have a match, all source variables will be null for that row. If a row is matched multiple times, it will be multiplied in the output.
Table A: Table B:
/*---------+----------* /*---------+-------*
| student | absences | | student | score |
+---------+----------+ +---------+-------+
| jane | 0 | | jane | 85 |
| sam | 6 | | sam | 64 |
| pat | 1 | | pat | 88 |
| zay | 2 | | toni | 30 |
*----------+----------*/ *---------+-------*/
FULL JOIN on student
Output:
/*---------+----------+-------*
| student | absences | score |
+---------+----------+-------+
| jane | 0 | 85 |
| sam | 6 | 64 |
| pat | 1 | 88 |
| zay | 2 | NULL |
| toni | NULL | 30 |
*---------+----------*-------*/
Every row in the source table will be combined with every row in the joined table. This might be used to perform a join on a new variable that will be created downstream (such as in a geospatial join). You will almost always need to use a filter after this join for the query to successfully execute.
Table A: Table B:
/*---------+----------* /*---------+-------*
| student | absences | | student | score |
+---------+----------+ +---------+-------+
| jane | 0 | | jane | 85 |
| sam | 6 | | sam | 64 |
*----------+----------*/ *---------+-------*/
CROSS JOIN
Output:
/*---------+----------+---------+---------*
| student | absences | student | score |
+---------+----------+---------+---------+
| jane | 0 | jane | 85 |
| jane | 0 | sam | 64 |
| sam | 6 | jane | 85 |
| sam | 6 | sam | 64 |
*---------+----------*---------+---------*/
It is strongly recommended to use a Filter step with a cross join to avoid a massively expanded table.
Depending on the structure of your data, joins might multiply in rows.
If the variable you are joining on has no duplicate values in either the source table or the joined table, then your output table will always have the same or fewer rows than the sum of both table's rows added together.
Table A: Table B:
/*---------+----------* /*---------+-------*
| student | absences | | student | score |
+---------+----------+ +---------+-------+
| jane | 0 | | jane | 85 |
| sam | 6 | | sam | 64 |
| pat | 1 | | pat | 88 |
*----------+----------*/ *---------+-------*/
INNER JOIN on student
Output:
/*---------+----------+-------*
| student | absences | score |
+---------+----------+-------+
| jane | 0 | 85 |
| sam | 6 | 64 |
| pat | 1 | 88 |
*---------+----------*-------*/
If the variable you are joining on has duplicates in either table, any time there is is a duplicate row it will be matched for each duplicate.
Table A: Table B:
/*---------+----------* /*---------+-------*
| student | absences | | student | score |
+---------+----------+ +---------+-------+
| jane | 0 | | jane | 85 |
| sam | 6 | | sam | 64 |
| pat | 1 | | pat | 88 |
*----------+----------*/ | pat | 86 |
*---------+-------*/
INNER JOIN on student
Output:
/*---------+----------+-------*
| student | absences | score |
+---------+----------+-------+
| jane | 0 | 85 |
| sam | 6 | 64 |
| pat | 1 | 88 |
| pat | 1 | 86 |
*---------+----------*-------*/
If the variable you are joining on has duplicates in both tables, any time there is is a duplicate row it will be matched for each duplicate.
Table A: Table B:
/*---------+----------* /*---------+-------*
| student | absences | | student | score |
+---------+----------+ +---------+-------+
| jane | 0 | | jane | 85 |
| sam | 6 | | sam | 64 |
| pat | 1 | | pat | 88 |
| pat | 3 | | pat | 68 |
*----------+----------*/ *---------+-------*/
INNER JOIN on student
Output:
/*---------+----------+-------*
| student | absences | score |
+---------+----------+-------+
| jane | 0 | 85 |
| sam | 6 | 64 |
| pat | 1 | 88 |
| pat | 1 | 86 |
| pat | 3 | 88 |
| pat | 3 | 86 |
*---------+----------*-------*/
Let's say our information about student absenses and test scores are in separate tables and we want to join them together.
Source table (t0) Table 1 (t1)
/*---------+----------* /*---------+-------*
| student | absences | | student | score |
+---------+----------+ +---------+-------+
| jane | 0 | | jane | 85 |
| sam | 6 | | sam | 64 |
| pat | 1 | | pat | 88 |
*----------+----------*/ *---------+-------*/

- Join table: The data we want to join is in
Table 1
so we select it here. - Join type: Since we only care about students that have both absence and score information we choose
inner join
which will drop any rows without a match in both tables. - Source variable / Joined table variable: We want to match on the variable
student
which is present in both tables. So we select our source table (t0)'sstudent
on the left, and our joined table (t1)'sstudent
variable on the right.- Since this variable has the same name in both tables, they will be combined into one variable in the output table.
/*---------+----------+-------*
| student | absences | score |
+---------+----------+-------+
| jane | 0 | 85 |
| sam | 6 | 64 |
| pat | 1 | 88 |
*---------+----------*-------*/
Let's say instead of a table with aggregated absences we have a daily timeliness chart. We want to join the scores for the corresponding attendance information for both the student and date in question.
Source table (t0) Table 1 (t1)
/*---------+------------+---------* /*---------+------------+-------*
| student | date | on_time | | student | date | score |
+---------+------------+---------+ +---------+------------+-------+
| jane | 2020-01-01 | TRUE | | jane | 2020-01-01 | 85 |
| sam | 2020-01-01 | FALSE | | sam | 2020-01-01 | 65 |
| jane | 2020-02-01 | FALSE | *---------+------------+-------*/
*----------+------------+--------*/

Join type: Since we want to keep our absence data whether a test was taken or not, we do a
Left
join.Join condition: We join on fields where BOTH the values in the
student
and date
variables are a match by selecting both in our join condition. /*---------+------------+---------+-------*
| student | date | on_time | score |
+---------+------------+---------+-------+
| jane | 2020-01-01 | TRUE | 85 |
| sam | 2020-01-01 | FALSE | 65 |
| jane | 2020-02-01 | FALSE | NULL |
*----------+------------+---------+-------*/
To use a more complex join condition you can use a cross join, followed by filter. The cross join will make all possible joins between these two tables and then the filter will limit it to rows matching your condition. Effectively this will execute the same as if you had done your initial join with a more complex condition.
Let's say in the example below we want to do an inner join where T0
student_id
= T1 student_id
but in the source table student_id
is an integer type variable and in Table 1 it is a string. We want to retype the variable in Table 1 before we can join on it, but we can't retype it until we have joined in Table 1. So we will do a cross join, a retype, then a filter to narrow down to only records meeting our condition.Source table (t0) Table 1 (t1)
/*------------+----------* /*------------+-------*
| student_id | absences | | student_id | score |
+------------+----------+ +------------+-------+
| 01 | 0 | | 01 | 85 |
| 02 | 6 | | 02 | 64 |
| 03 | 1 | | 03 | 88 |
*-------------+----------*/ *------------+-------*/



/*------------+----------+-------*
| student_id | absences | score |
+------------+----------+-------+
| jane | 0 | 85 |
| sam | 6 | 64 |
| pat | 1 | 88 |
*------------+----------*-------*/
Last modified 5mo ago