# Step: Join

## Overview

A **Join** step will combine data from two tables based on a join condition so it can be queried together.

<pre><code><strong>Table A:                     Table B:
</strong><strong>/*---------+----------*      /*---------+-------*
</strong> | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/
</code></pre>

#### Example output data:

Inner join where `student` = `student`

```
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/
```

## Step structure

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2Fmsr6arcOPHRvHnHk0740%2FScreenshot%202023-05-04%20at%2012.27.31%20PM.png?alt=media&#x26;token=ceebeb6f-58f4-42ac-a3d5-6cf6b52224f8" alt=""><figcaption></figcaption></figure>

* There will be at least one join block where you will define a table to join and a join condition for that table.
* When multiple blocks exist, the tables will be joined in sequence.

## Field descriptions

<table><thead><tr><th width="226">Field</th><th>Description</th></tr></thead><tbody><tr><td><strong>Join table</strong></td><td>The table containing the data you wish to combine with this transform's data. <br><br>Once selected, this table's variable will be marked t1 (or t2 etc. if other joins exist).</td></tr><tr><td><strong>Join type</strong></td><td>The way that the tables will be combined. (More information below.)</td></tr><tr><td><strong>Source variable</strong> </td><td>The variable in your source table that you want to join on. All records in this variable will be matched against all records in the Joined table variable.</td></tr><tr><td><strong>Joined table variable</strong></td><td>The variable in your selected Join table that you want to join on. All records in this variable will be matched against all records in the Source variable.</td></tr></tbody></table>

## Join types

### Inner join

If a row in either the source or join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.

<pre><code>Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/
<strong>
</strong><strong>INNER JOIN on student
</strong>
Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/
</code></pre>

### Left join

If a row in the source doesn’t have a match, all joined variables will be null for that row. If a row in the join table doesn’t have a match, it will be dropped. If a row is matched multiple times, it will be multiplied in the output.

```
Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

LEFT JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | zay     | 2        | NULL  |
 *---------+----------*-------*/
```

### Right join

If a row in the source doesn’t have a match, it will be dropped. If a row in the join table doesn’t have a match, all source variables will be null for that row. If a row is matched multiple times, it will be multiplied in the output. This is the same as a left join, but reversed.

```
Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

RIGHT JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | toni    | NULL     | 30    |
 *---------+----------*-------*/
```

### Full join

If a row in either the source or join table doesn’t have a match, all source variables will be null for that row, including for the column being joined upon. If a row is matched multiple times, it will be multiplied in the output.

```
Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | zay     | 2        |       | toni    | 30    |
*----------+----------*/      *---------+-------*/

FULL JOIN on student

Output:
/*-----------+-----------+----------+--------*
 | student_A | student_B | absences | score  |
 +-----------+-----------+----------+--------+
 | jane      | jane      | 0        | 85     |
 | sam       | sam       | 6        | 64     |
 | pat       | pat       | 1        | 88     |
 | zay       | NULL      | 2        | NULL   |
 | NULL      | toni      | NULL     | 30     |
 *-----------+-----------+----------+--------*/
```

### Cross join

Every row in the source table will be combined with every row in the joined table. This might be used to perform a join on a new variable that will be created downstream (such as in a geospatial join). You will almost always need to use a filter after this join for the query to successfully execute.&#x20;

```
Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
*----------+----------*/      *---------+-------*/

CROSS JOIN

Output:
/*---------+----------+---------+---------*
 | student | absences | student | score   |
 +---------+----------+---------+---------+
 | jane    | 0        | jane    | 85      |
 | jane    | 0        | sam     | 64      |
 | sam     | 6        | jane    | 85      |
 | sam     | 6        | sam     | 64      |
 *---------+----------*---------+---------*/
```

{% hint style="warning" %}
It is **strongly** recommended to use a [Filter](https://docs.redivis.com/reference/workflows/transforms/step-filter) step with a cross join to avoid a massively expanded table.
{% endhint %}

## Geospatial joins

When querying geospatial data, you'll often want to match records where one polygon is contained in another, or otherwise overlaps. To perform a geospatial join, you'll typically perform the following steps:

1. First, create an [inner join](#inner-join) with the table you'd like to join on.
2. Next, create variable(s) using a geography method that will represent your join condition. For example, you might use the [`contains`](https://docs.redivis.com/reference/workflows/variable-creation-methods/geography#contains)  method to join all geometries in one table that exist within another.&#x20;
3. Finally, implement a [row filter](https://docs.redivis.com/reference/workflows/transforms/step-filter) that tests against the newly created variable. In our example above, if we created an `is_contained_by` variable, we would filter on the condition `is_contained_by = TRUE`&#x20;

{% hint style="warning" %}
There are a few performance pitfalls when performing geospatial joins. Most notably, combining other equality comparisons with the geospatial condition in a filter can prevent the query planner from leveraging geospatial indexes, leading to a massive performance de-optimization.

For example, executing a filter of the form `t0.state = t1.state AND is_contained_by=TRUE` would actually be significantly less performant than if the state equality comparison is removed, even though it seems that this would reduce the number of times the geospatial condition needs to be evaluated.

For more discussion on geospatial performance in BigQuery, Redivis's underlying querying engine, see [here](https://mentin.medium.com/bigquery-gis-performance-tips-4a98e07ca4cb) and [here](https://cloud.google.com/bigquery/docs/best-practices-spatial-analysis).  &#x20;
{% endhint %}

## Join structures

Depending on the structure of your data, joins might multiply in rows.&#x20;

### 1-to-1 joins

If the variable you are joining on has no duplicate values in either the source table or the joined table, then your output table will always have the same or fewer rows than the sum of both table's rows added together.&#x20;

```
Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/
```

### 1-many joins

If the variable you are joining on has duplicates in either table, any time there is is a duplicate row it will be matched for each duplicate.

```
Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      | pat     | 86    |
                              *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | pat     | 1        | 86    |
 *---------+----------*-------*/
```

### Many-to-many joins

If the variable you are joining on has duplicates in both tables, any time there is is a duplicate row it will be matched for each duplicate.

<pre><code>Table A:                     Table B:
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
 | pat     | 3        |       | pat     | 68    |
*----------+----------*/      *---------+-------*/

INNER JOIN on student

Output:
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 | pat     | 1        | 86    |
 | pat     | 3        | <a data-footnote-ref href="#user-content-fn-1">88</a>    |
 | pat     | 3        | 86    |
 *---------+----------*-------*/
</code></pre>

## Examples

### Example 1: Simple join condition

Let's say our information about student absenses and test scores are in separate tables and we want to join them together.&#x20;

#### Starting data:

```
Source table (t0)             Table 1 (t1)    
/*---------+----------*      /*---------+-------*
 | student | absences |       | student | score |
 +---------+----------+       +---------+-------+
 | jane    | 0        |       | jane    | 85    |
 | sam     | 6        |       | sam     | 64    |
 | pat     | 1        |       | pat     | 88    |
*----------+----------*/      *---------+-------*/
```

#### Input fields:

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2Fr6IJ8PiPBMCbUcZ11lc4%2FScreenshot%202023-05-04%20at%2012.43.27%20PM.png?alt=media&#x26;token=29dd1bcd-7b71-438c-9ff9-a66bccdb5313" alt=""><figcaption></figcaption></figure>

* **Join table:** The data we want to join is in `Table 1` so we select it here.
* **Join type:** Since we only care about students that have both absence and score information we choose `inner join` which will drop any rows without a match in both tables.
* **Source variable / Joined table variable:** We want to match on the variable `student` which is present in both tables. So we select our source table (t0)'s `student` on the left, and our joined table (t1)'s `student` variable on the right.&#x20;
  * Since this variable has the same name in both tables, they will be combined into one variable in the output table.

#### Output data:

```
/*---------+----------+-------*
 | student | absences | score |
 +---------+----------+-------+
 | jane    | 0        | 85    |
 | sam     | 6        | 64    |
 | pat     | 1        | 88    |
 *---------+----------*-------*/
```

### Example 2: More specific join condition

Let's say instead of a table with aggregated absences we have a daily timeliness chart. We want to join the scores for the corresponding attendance information for both the student and date in question.

#### Starting data:

```
Source table (t0)                        Table 1 (t1)    
/*---------+------------+---------*      /*---------+------------+-------*
 | student | date       | on_time |       | student | date       | score |
 +---------+------------+---------+       +---------+------------+-------+
 | jane    | 2020-01-01 | TRUE    |       | jane    | 2020-01-01 | 85    |
 | sam     | 2020-01-01 | FALSE   |       | sam     | 2020-01-01 | 65    |
 | jane    | 2020-02-01 | FALSE   |       *---------+------------+-------*/
*----------+------------+--------*/     
```

#### Input fields:

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2F5BLcKHhOwqUhSNcLz59k%2FScreenshot%202023-05-04%20at%2012.53.20%20PM.png?alt=media&#x26;token=7fd27dd3-5311-4aeb-b0eb-059d9b1d97a9" alt=""><figcaption></figcaption></figure>

**Join type:** Since we want to keep our absence data whether a test was taken or not, we do a `Left` join.

**Join condition:** We join on fields where BOTH the values in the `student` and `date` variables are a match by selecting both in our join condition.&#x20;

#### Output data:

```
/*---------+------------+---------+-------*
 | student | date       | on_time | score |
 +---------+------------+---------+-------+
 | jane    | 2020-01-01 | TRUE    | 85    |
 | sam     | 2020-01-01 | FALSE   | 65    |
 | jane    | 2020-02-01 | FALSE   | NULL  |
*----------+------------+---------+-------*/     
```

{% hint style="info" %}
You can use a [SQL query](https://docs.redivis.com/reference/workflows/transforms/step-sql-query) step to build more complex join conditions
{% endhint %}

### Example 3: Complex join conditions with cross joins

In some cases you might need to do a more complex join, such as using an operator other than `=`, or joining on a condition that uses a variable from your joined table that needs to be [retyped](https://docs.redivis.com/reference/workflows/transforms/step-retype) or [created](https://docs.redivis.com/reference/workflows/transforms/step-create-variables) first. This is common in geospatial joins but might come up in any situation.&#x20;

To use a more complex join condition you can use a cross join, followed by [filter](https://docs.redivis.com/reference/workflows/transforms/step-filter). The cross join will make all possible joins between these two tables and then the filter will limit it to rows matching your condition. Effectively this will execute the same as if you had done your initial join with a more complex condition.

Let's say in the example below we want to do an inner join where T0 `student_id` = T1 `student_id` but in the source table `student_id` is an integer [type](https://docs.redivis.com/tables/variables#characteristics) variable and in Table 1 it is a string. We want to retype the variable in Table 1 before we can join on it, but we can't retype it until we have joined in Table 1. So we will do a cross join, a retype, then a filter to narrow down to only records meeting our condition.

#### Starting data:

```
Source table (t0)                Table 1 (t1)    
/*------------+----------*      /*------------+-------*
 | student_id | absences |       | student_id | score |
 +------------+----------+       +------------+-------+
 | 01         | 0        |       | 01         | 85    |
 | 02         | 6        |       | 02         | 64    |
 | 03         | 1        |       | 03         | 88    |
*-------------+----------*/      *------------+-------*/
```

#### Input fields:

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FeR4B6w8XnHIdSD16KBhO%2FScreenshot%202023-06-23%20at%2010.41.43%20PM.png?alt=media&#x26;token=337f0ed6-9eb9-4549-81c8-92b91ca71d9f" alt=""><figcaption></figcaption></figure>

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FxvwbZvAwv06QEllC2xH9%2FScreenshot%202023-06-23%20at%2010.41.57%20PM.png?alt=media&#x26;token=a2e7010b-c7de-4acc-85e1-564a3152a725" alt=""><figcaption></figcaption></figure>

<figure><img src="https://1672950126-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LVodLwUXgJUGcm5Cvso%2Fuploads%2FbXqQdwpw6CRWthS2lTVv%2FScreenshot%202023-06-23%20at%2010.54.06%20PM.png?alt=media&#x26;token=a3f7ae0b-34ec-473a-afe5-eb330e6d12ef" alt=""><figcaption></figcaption></figure>

#### Output data:

```
/*------------+----------+-------*
 | student_id | absences | score |
 +------------+----------+-------+
 | jane       | 0        | 85    |
 | sam        | 6        | 64    |
 | pat        | 1        | 88    |
 *------------+----------*-------*/
```

[^1]:
