Importing data

Overview

In order to import data, you must first create an "append" or "replace" block within the edit tab of the version interface. The first version of a dataset will always have an initial upload block.

If your data is split across multiple files (e.g., one file per year) you will be able to upload all of them here and they will automatically be appended together based on common variable (column) names.

Supported file types

Type

Description

Notes

.csv, .tsv, .psv, .dsv, .txt, .*

Any text-delimited file. Redivis will auto-infer the delimiter, or you can specify it manually. This is also the default format for files with unknown or missing file extensions.

See working with text-delimited files

.avro

Avro format

Compressed data blocks using the DEFLATE and Snappy codecs are supported.

Nested and repeated fields are not supported.

.parquet

Parquet format

Nested and repeated fields are not supported.

.orc

Orc format

Nested and repeated fields are not supported.

.ndjson

Newline-delimited JSON

Nested and repeated fields are not supported.

.xls, .xlsx

Excel file

Only the first sheet will be ingested.

.sas7bdat

SAS data file

Default formats will be interpreted to the corresponding variable type, and variable labels will automatically be imported.

User defined formats (.sas7bcat) are not support.

.dta

Stata data file

Variable labels and value labels will automatically be imported.

.sav

SPSS data file

Variable labels and value labels will automatically be imported.

Google sheets

Sheets file stored in Google Drive

Only the first tab of data will be ingested.

Uploading compressed (gzipped) files:

Generally, you should upload uncompressed data files to Redivis, as uncompressed files can be read in parallel and thus upload substantially faster. If you prefer to store your source data in a compressed format, Avro, Parquet, and ORC are the preferred data formats, as these support parallelized compressed data ingestion at the row level.

Redivis does support loading compressed text-delimited and NDJSON data files up to 4GB (compressed size). This functionality is only supported when loading the file from a URL or other integration endpoint; local files cannot be uploaded in a compressed format. If your data is compressed, it must be served with the header Content-Encoding: gzip

Quotas & limits

  • Max dataset size: 15TB

  • Max row size: 100MB

  • Max file sizes

    • Avro: 5TB

    • Parquet: 5TB

    • ORC: 5TB

    • CSV: 5TB (4GB compressed)

    • NDJSON: 5TB (4GB compressed)

    • SAS(.sas7bdat): 500GB

    • Stata(.dta), SPSS(.sav): 10GB

    • Excel(xls, xlsx): 5GB

  • Max variables: 5,000

  • Max file uploads per version: 500

Working with text-delimited files

A text-delimited file is a file that uses a specific character (the delimiter) to separate columns, with newlines separating rows. To support

Delimited file requirements

  • Must be UTF-8 encoded (ASCII is a valid subset of UTF-8)

  • Quote characters in cells must be properly escaped. For example, if a cell contains the content: Jane said, "Why hasn't this been figured out by now?" it must be encoded as: "Jane said, ""Why hasn't this been figured out by now?"""

  • The quote character must be used to escape the quote character. For example, the sequence \" is not valid for an escaped quote; it must be ""

  • Temporal data types must be properly formatted in order to parsed as the proper type, otherwise they will be ingested as strings. See variable names and types for more info.

  • Empty strings will be converted to null values

Delimited file options

Delimiter

The delimiter will be auto-inferred based upon an analysis of the file being uploaded. In rare cases, this inference may fail; you can specify the delimiter to override this inference.

Quote character

Specify the character used to escape delimiters. Generally " , though some files may not have a quote character (in which case, they must not include the delimiter within any cells).

Has header row

Specifies whether the first row is a header containing the variable names. This will cause data to be read beginning on the 2nd row. If you don't provide a header in your file, variables will be automatically created as var1, var2, var3, etc...

Allow quoted newlines

This option is necessary if there are newline characters within particular cells in your data (e.g., multiple paragraphs in one cell). Checking this box unnecessarily will not cause any errors, though it will substantially slow down data ingest, and may cause inaccuracies in other error reporting.

Skip corrupted records

By default, an upload will fail if a corrupted record is encountered. This includes a record that has a mismatched number of columns, or is otherwise not parsable. If this box is checked, the number of skipped records will be displayed next to each file once it has been imported.

Variable names and types

Renaming variables

Variable names are automatically inferred from the source data, with invalid characters replaced with an underscore (_). If the same variable is found more than once in any given file, it will automatically have a counter appended to it (e.g., "varName2"). You may wish to rename the variables to make them more human-readable; alternatively, you may wish to rename some files' variables in order to properly append the files together.

In order to rename the variable on an individual file upload, click that upload in the left panel of the file manager, and then click on the variable name in question to begin editing.

You can click the Revert button to remove any changes you've made.

Variable type inference

All values of a variable must be compatible with its type. Redivis will automatically choose the most specific, valid type for a variable, with string being the default type. You can subsequently retype variables when cleaning your uploaded data, with the ability to recognize various date / time formats.

Please note the following rules:

  • If all values of a variable are null, its type will be string

  • Numeric values with leading zeros will be stored as string in order to preserve the leading zeros (e.g., 000583 )

  • Data stored with decimal values will be stored as a float , even if that value is a valid integer (e.g., 1.0 ).

  • Temporal data types must be formatted as follows:

    • Date: YYYY-[M]M-[D]D

    • DateTime: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]

    • Time: [H]H:[M]M:[S]S[.DDDDDD]

Working with multiple files

You can upload up to 500 files in the File uploader. Files will automatically be appended to each other based on their variable names (case insensitive), with the goal of creating one continuous table with a consistent schema.

To view the concatenation of all of your files, click the All files item in the left bar of the file manager. This will display the union of variable names, types, as well as variables' presence across your files.

Bulk renaming variables

To edit the name of a variable across multiple files at once, click on the All files item in the left bar of the file manager. You may then rename a variable as before

Conflicting variable types

If files have conflicting types across a given variable, the lowest-denominator type for that variable is chosen when the files are merged.

Import sources

By default, you may upload data from your local computer or a public URL. However, Redivis supports numerous integrations for data ingest across common sources. Please note that you will have to enable the corresponding integration on your workspace before being able to use it.

Google Cloud Storage (GCS)

You may import any object that you have read access to in GCS by specifying a bucket name and path to that object. You may import multiple objects at once by providing a prefix followed by a wildcard character, e.g.: /my-bucket/my-folder/* or /my-bucket/my-folder/prefix* .

Amazon S3

You may import any object that you have read access to in GCS by specifying a bucket name and path to that object. You may import multiple objects at once by providing a prefix followed by a wildcard character, e.g.: /my-bucket/my-folder/* or /my-bucket/my-folder/prefix* .

Google Drive

You may import any file of valid format that you have stored within your Drive, including Google Sheets.

Google BigQuery

You may import any table that you have read access to in BigQuery. You must specify the table in the form project_name.dataset_id.table_id . To import multiple tables within a dataset, you may use wildcards. E.g., project_name.dataset_id.* or project_name.dataset_id.prefix* .

Please note that table views are not currently supported.

Box

You may import any file of valid format that you have stored within Box.

OneDrive

You may import any file of valid format that you have stored within OneDrive.

Error handling

A file may fail to import due to several reasons; in each case, Redivis endeavors to provide a clear error message for you to fix the error.

Network issues

When transferring a file from your computer (or more rarely, from other import sources), there may be an interruption to the internet connection that prevents the file from being fully uploaded. In these cases, you should simply try uploading the file again.

Invalid or corrupted source data

Data invalidity is most common when uploading text-delimited files, though it can happen with any file format. While some data invalidity errors may require further investigation off of Redivis, others may be due to incorrect options provided in the file upload process. When possible, Redivis will display ~1000 characters that are near the error in the source file, allowing you to identify the potential source of failure.

For example, the following screenshot highlights content near the error, where we can see that a single cell contains multiple line break characters. By reimporting the file setting the "Allow quoted new lines" option to true, this problem will be resolved.