patient_id
, enrollment_start_date
, and enrollment_end_date
in the Keep section.‌patient_id
) to ensure that each patient is processed individually. But, to account for the fact that a single patient may have many periods of continuous enrollment, each spanning many months, we need to identify the correct start (from enrollment_start_date
) and end (from enrollment_end_date
) of a continuous period out of multiple rows and capture them in a single row.‌lag_end_date
using the lag method, which will order each row by enrollment_start_date
in a given patient_id
partition and copy the previous row's enrollment_end_date
value into the each row. We also create lead_start_date
using lead, to copy the following row's enrollment_start_date
value into each row.‌enrollment_start_date
and lag_end_date
to find the difference (in days) of the start of each period and the end of the previous period. and compare enrollment_end_date
and lead_start_date
to find the end of the period.‌diff_lag_end_enrollment_start
(created using date diff) which rows describe an enrollment period directly following the previous period, and which rows describe an enrollment period with a larger gap since the previous period. We also create diff_enrollment_end_lead_start
to identify gaps between an enrollment period and the next period.diff_lag_end_enrollment_start
value of either NULL (the row is the first period in the partition) or greater than 1 (the row comes after a gap in enrollment). And we identify rows corresponding to the end of a continuous period as those with a diff_enrollment_end_lead_start
value of either NULL (the row is the last period in the partition) or greater than 1 (the row comes before a gap in enrollment).‌is_start_continuous_period
and is_end_continuous_period
(created the case method) if a given row corresponds to the start of a continuous period, the end of a continuous period, or both.true
for either is_start_continuous_period
or is_end_continuous_period
.‌lead...
, lag...
, and diff...
variables, since our final processing step will only consider a row's is_start_continuous_period
and is_end_continuous_period
values.lead_end_date
(in the same transform is fine, since this step will happen after the previous filter) which copies the enrollment_end_date
value of the following row on to each row.‌is_start_continuous_period
as true
, since our lead_end_date
has copied over the end date of the continuous enrollment period, contained in each row's following row.enrollment_start_date
value contains the start of a continuous period, but these rows fall into two categories:‌is_end_continuous_period
value is false
). For these, we want to look at lead_end_date
, the end date of the next row, which represents the final date of the continuous period, since we filtered out all the intermediate rows above.is_end_continuous_period
value is true
) – in this example, if the continuous period was only 1 month. For these, the row defines a period (in this example, 1 month) that also contains the end date, so we just get the enrollment_end_date
from that same row. Note that the lead_end_date
value is incorrect in this case, since the next row contains the start of the next continuous period, or NULL if the period falls at the end of a partition.enrollment_end_date_continuous
, created in an additional transform, since our previous operation involved a partition.enrollment_start_date_continuous
for consistency, and end date of each continuous enrollment period.