Continuous enrollment
The following example is illustrated on Redivis in the MarketScan Continuous Enrollment project – you'll need access to MarketScan data to view the details.
Many insurance claims datasets on Redivis contain information about enrollment, detailing the periods of time when an individual represented in a dataset was covered by an insurance plan. If you intend to characterize patients based on their insurance coverage (or lack thereof) during certain key events (procedures, diagnoses, etc), it's often important to identify periods of continuous enrollment for each individual – and capture each continuous enrollment period for each patient in a single row.
These claims datasets describe enrollment information in multiple discrete rows per patient, each corresponding to patient per month. However, an overall continuous enrollment period may be broken up across rows into months or other non-uniform chunks, so we'll employ the following process to combine multiple sequential rows into a single row with one start date and one end date, describing one continuous period.
In this example, we will process the MarketScan Enrollment Detail table to create a table in which each row describes a single period of continuous enrollment. We show an artificial snapshot of the dataset below, where patient 1 has multiple periods of continuous enrollment due to some gaps in coverage, and patient 2 has a single period of continuous enrollment.
patient_id | enrollment_start_date | enrollment_end_date |
1 | 2012-01-01 | 2012-01-31 |
1 | 2012-02-01 | 2012-02-28 |
1 | 2012-04-01 | 2012-04-30 |
1 | 2012-06-01 | 2012-06-30 |
1 | 2012-07-01 | 2012-07-31 |
1 | 2012-08-01 | 2012-08-31 |
2 | 2012-01-01 | 2012-01-31 |
We want to create a final table with 3 rows for patient 1 to account for gaps in enrollment in March and May of 2012, and 1 row for patient 2. Our desired output has a row for each continuous period per patient, shown below:
patient_id | enrollment_start_date_continuous | enrollment_end_date_continuous |
1 | 2012-01-01 | 2012-02-28 |
1 | 2012-04-01 | 2012-04-30 |
1 | 2012-06-01 | 2012-08-31 |
2 | 2012-01-01 | 2012-01-31 |
The variable names in this example are not actual MarketScan variable names. But, with appropriate data access, you can see the real variables used in the first transform of the Redivis example project by hovering over the (renamed) variables patient_id
, enrollment_start_date
, and enrollment_end_date
in the Keep section.
(1) Add start and end of adjacent periods to each row
Throughout this example, we'll create variables that partition the dataset by patient identifier (here, patient_id
) to ensure that each patient is processed individually. But, to account for the fact that a single patient may have many periods of continuous enrollment, each spanning many months, we need to identify the correct start (from enrollment_start_date
) and end (from enrollment_end_date
) of a continuous period out of multiple rows and capture them in a single row.
First, we create a partition variable lag_end_date
using the lag method, which will order each row by enrollment_start_date
in a given patient_id
partition and copy the previous row's enrollment_end_date
value into the each row. We also create lead_start_date
using lead, to copy the following row's enrollment_start_date
value into each row.
These methods generate values which tell us how close the preceding and following enrollment periods are with respect the each row's enrollment period.
patient_id | enrollment_start_date | enrollment_end_date | lag_end_date | lead_start_date |
1 | 2012-01-01 | 2012-01-31 | NULL | 2012-02-01 |
1 | 2012-02-01 | 2012-02-28 | 2012-01-31 | 2012-04-01 |
1 | 2012-04-01 | 2012-04-30 | 2012-02-28 | 2012-06-01 |
1 | 2012-06-01 | 2012-06-30 | 2012-04-30 | 2012-07-01 |
1 | 2012-07-01 | 2012-07-31 | 2012-06-30 | 2012-08-01 |
1 | 2012-08-01 | 2012-08-31 | 2012-07-31 | NULL |
2 | 2012-01-01 | 2012-01-31 | NULL | NULL |
(2) Select rows with start and end of continuous periods
In a second, downstream transform we'll create new variables, which will use the above lead and lag values to identify which rows correspond to the beginning and end of a continuous enrollment period.
First, we'll compare enrollment_start_date
and lag_end_date
to find the difference (in days) of the start of each period and the end of the previous period. and compare enrollment_end_date
and lead_start_date
to find the end of the period.
We see via diff_lag_end_enrollment_start
(created using the date diff method) which rows describe an enrollment period directly following the previous period, and which rows describe an enrollment period with a larger gap since the previous period. We also create diff_enrollment_end_lead_start
to identify gaps between an enrollment period and the next period.
patient_id | enrollment_start_date | enrollment_end_date | lag_end_date | lead_start_date | diff_lag_end_enrollment_start | diff_enrollment_end_lag_start |
1 | 2012-01-01 | 2012-01-31 | NULL | 2012-02-01 | NULL | 1 |
1 | 2012-02-01 | 2012-02-28 | 2012-01-31 | 2012-04-01 | 1 | 31 |
1 | 2012-04-01 | 2012-04-30 | 2012-02-28 | 2012-06-01 | 31 | 31 |
1 | 2012-06-01 | 2012-06-30 | 2012-04-30 | 2012-07-01 | 31 | 1 |
1 | 2012-07-01 | 2012-07-31 | 2012-06-30 | 2012-08-01 | 1 | 1 |
1 | 2012-08-01 | 2012-08-31 | 2012-07-31 | NULL | 1 | NULL |
2 | 2012-01-01 | 2012-01-31 | NULL | NULL | NULL | NULL |
Next, we'll encode booleans from our difference variables to simplify further filtering. We'll identify rows corresponding to the start of a continuous period as those with a diff_lag_end_enrollment_start
value of either NULL (the row is the first period in the partition) or greater than 1 (the row comes after a gap in enrollment). And we identify rows corresponding to the end of a continuous period as those with a diff_enrollment_end_lead_start
value of either NULL (the row is the last period in the partition) or greater than 1 (the row comes before a gap in enrollment).
We see via boolean variables is_start_continuous_period
and is_end_continuous_period
(created the case method) if a given row corresponds to the start of a continuous period, the end of a continuous period, or both.
patient_id | enrollment_start_date | enrollment_end_date | lag_end_date | lead_start_date | diff_lag_end_enrollment_start | diff_enrollment_end_lag_start | is_start_continuous_period | is_end_continuous_period |
1 | 2012-01-01 | 2012-01-31 | NULL | 2012-02-01 | NULL | 1 | true | false |
1 | 2012-02-01 | 2012-02-28 | 2012-01-31 | 2012-04-01 | 1 | 31 | false | true |
1 | 2012-04-01 | 2012-04-30 | 2012-02-28 | 2012-06-01 | 31 | 31 | true | true |
1 | 2012-06-01 | 2012-06-30 | 2012-04-30 | 2012-07-01 | 31 | 1 | true | false |
1 | 2012-07-01 | 2012-07-31 | 2012-06-30 | 2012-08-01 | 1 | 1 | false | false |
1 | 2012-08-01 | 2012-08-31 | 2012-07-31 | NULL | 1 | NULL | false | true |
2 | 2012-01-01 | 2012-01-31 | NULL | NULL | NULL | NULL | true | true |
Then, to capture only the start and end of a continuous period, we'll use a filter to keep only rows which are true
for either is_start_continuous_period
or is_end_continuous_period
.
This leaves us with either 1 or 2 rows corresponding to a continuous enrollment period. If a continuous period spans multiple rows (months, in this case), we'll have 2 rows (a start row and an end row). But if a period only spans one row, we'll have both start and end captured by that 1 row. In our example, the row containing the middle (neither start nor end) of the 2012-06-01 to 2012-08-31 enrollment period for patient 1 was dropped. We can also ignore our intermediate lead...
, lag...
, and diff...
variables, since our final processing step will only consider a row's is_start_continuous_period
and is_end_continuous_period
values.
patient_id | enrollment_start_date | enrollment_end_date | is_start_continuous_period | is_end_continuous_period |
1 | 2012-01-01 | 2012-01-31 | true | false |
1 | 2012-02-01 | 2012-02-28 | false | true |
1 | 2012-04-01 | 2012-04-30 | true | true |
1 | 2012-06-01 | 2012-06-30 | true | false |
1 | 2012-08-01 | 2012-08-31 | false | true |
2 | 2012-01-01 | 2012-01-31 | true | true |
(3) Add end date of following enrollment periods to each row and collapse
Finally, we want to collapse our table to ensure 1 row per continuous enrollment period per patient. Since we have only rows corresponding to start and end of continuous periods, we create another partition variable lead_end_date
(in the same transform is fine, since this step will happen after the previous filter) which copies the enrollment_end_date
value of the following row on to each row.
We can also use the partition row filter to keep only rows with is_start_continuous_period
as true
, since our lead_end_date
has copied over the end date of the continuous enrollment period, contained in each row's following row.
We end up with a table where each and every row contains both the start date of the continuous enrollment period and the end date of that continuous enrollment period.
patient_id | enrollment_start_date | enrollment_end_date | lead_end_date | is_start_continuous_period | is_end_continuous_period |
1 | 2012-01-01 | 2012-01-31 | 2012-02-28 | true | false |
1 | 2012-04-01 | 2012-04-30 | 2012-06-30 | true | true |
1 | 2012-06-01 | 2012-06-30 | 2012-08-31 | true | false |
2 | 2012-01-01 | 2012-01-31 | NULL | true | true |
A final processing step captures the correct end date of a continuous period. We now have only rows whose enrollment_start_date
value contains the start of a continuous period, but these rows fall into two categories:
First, we have the rows that do not also correspond to the end of an enrollment period (where the
is_end_continuous_period
value isfalse
). For these, we want to look atlead_end_date
, the end date of the next row, which represents the final date of the continuous period, since we filtered out all the intermediate rows above.Second, we have which also correspond to the end of an enrollment period (where the
is_end_continuous_period
value istrue
) – in this example, if the continuous period was only 1 month. For these, the row defines a period (in this example, 1 month) that also contains the end date, so we just get theenrollment_end_date
from that same row. Note that thelead_end_date
value is incorrect in this case, since the next row contains the start of the next continuous period, or NULL if the period falls at the end of a partition.
We capture the above logic in a new variable enrollment_end_date_continuous
, created in an additional transform, since our previous operation involved a partition.
We end up with a final table below, containing the patient identifier, and the start date (renamed to enrollment_start_date_continuous
for consistency, and end date of each continuous enrollment period.
patient_id | enrollment_start_date_continuous | enrollment_end_date_continuous |
1 | 2012-01-01 | 2012-02-28 |
1 | 2012-04-01 | 2012-04-30 |
1 | 2012-06-01 | 2012-08-31 |
2 | 2012-01-01 | 2012-01-31 |
Last updated