Cleaning data

Overview

All versions contain a series of edit blocks which are all applied in sequential order to create a table, viewed on the Edits tab of the version interface. The first version will always begin with an "Initial upload" block. Edit blocks are created by clicking the + Add new edit button.

Valid edit blocks will be automatically applied as you work, allowing you to audit your version as you go.

Multiple version edit blocks are supported for cleaning data

Edit block types

Initial upload

This block is only present on the first version, and cannot be removed. Click the "upload files" button to import data from various sources.

Replace files

Similar to the initial upload block, but only available on versions subsequent to v1.0. Replaces all content in the previous table with newly imported data. The replace block must always come first if it exists.

Append files

Similar to the "initial upload" block, with the same interface for importing data. However, the append block appends any uploaded data to the current table, matching on (case-insensitive) variable names.

The append block is the only edit to which other edits may be applied. You can add Recode, Delete, Retype, and Drop edit blocks to an append block, allowing you to clean only the records that are uploaded through that particular append block. This allows you to import data with different schema and clean them independently before they are all appended and combined into one coherent table.

Recode records

Allows you to change instances of a value within a variable to another value. For example, you can convert all instances of 'None' to NULL .

Delete records

Allows you to delete all records where a logical condition is true. For example, you could drop all records where PatientId IS NULL .

Drop variables

This edit block allows you to drop variables from the your upload and/or the previous version.

Retype variables

This edit block allows you to convert a variable to a different type, optionally replacing any invalid values with NULL when the "Set incompatible values to null" option is selected. Variable retyping follows standard variable retyping rules (note that converting floats to integers will automatically round them), and any invalid type transformations will prevent the version from being released.

When converting to date, date time, or time types, you will often need to specify a format string to read in data encoded in various display formats. Please consult the format string documentation for further information and examples.

Computing deltas

Each edit block can display the number of variables and records that were added, updated, and/or removed by that block. In some cases this can be computed immediately, but in other circumstances you will need to click the "Compute" text to view the modifications triggered by the particular edit. Note that any non-computed edits will automatically be computed when the version is released.