Data uploads

Overview

In order to create new tables (or release new versions, you will need to upload your data from existing sources). Redivis supports data ingest from numerous sources across myriad data formats, as well as the ability to combine multiple uploads into a single table.

For a guided walkthrough of uploading data to a dataset, please see the Creating a dataset guide.

Supported file types

Type

Description

Notes

.csv, .tsv, .psv, .dsv, .txt, .*

Any text-delimited file. Redivis will auto-infer the delimiter, or you can specify it manually. This is also the default format for files with unknown or missing file extensions.

See working with text-delimited files

.avro

Avro format

Compressed data blocks using the DEFLATE and Snappy codecs are supported.

Nested and repeated fields are not supported.

.parquet

Parquet format

Nested and repeated fields are not supported.

.orc

Orc format

Nested and repeated fields are not supported.

.ndjson

Newline-delimited JSON

Nested and repeated fields are not supported.

.xls, .xlsx

Excel file

Only the first sheet will be ingested.

.sas7bdat

SAS data file

Default formats will be interpreted to the corresponding variable type, and variable labels will automatically be imported.

User defined formats (.sas7bcat) are not support.

.dta

Stata data file

Variable labels and value labels will automatically be imported.

.sav

SPSS data file

Variable labels and value labels will automatically be imported.

Google sheets

Sheets file stored in Google Drive

Only the first tab of data will be ingested.

Uploading compressed (gzipped) files:

Generally, you should upload uncompressed data files to Redivis, as uncompressed files can be read in parallel and thus upload substantially faster. If you prefer to store your source data in a compressed format, Avro, Parquet, and ORC are the preferred data formats, as these support parallelized compressed data ingestion at the row level.

Redivis does support loading compressed text-delimited and NDJSON data files up to 4GB (compressed size). This functionality is only supported when loading the file from a URL or other integration endpoint; local files cannot be uploaded in a compressed format. If your data is compressed, it must be served with the header Content-Encoding: gzip

Quotas & limits

  • Max dataset size: 15TB

  • Max row size: 100MB

  • Max file sizes

    • Avro: 5TB

    • Parquet: 5TB

    • ORC: 5TB

    • CSV: 5TB (4GB compressed)

    • NDJSON: 5TB (4GB compressed)

    • SAS(.sas7bdat): 500GB

    • Stata(.dta), SPSS(.sav): 10GB

    • Excel(xls, xlsx): 5GB

  • Max variables (columns): 9,990 * †

  • Max uploads per table, per version: 500 †

* The variable maximum applies across all versions of a table. If a variable exists in a previous version of the table, but is subsequently deleted, it will still count towards this variable maximum.

† Depending on the length of your variable names, as well as the typing of your variables, the actual limits for max variables and uploads may be lower. The query generated to select variables from each upload cannot exceed 256K characters, and the query generated to select across all uploads cannot exceed 12M characters. If either of these limits are reached, the upload will fail with an accompanying error message.

Working with text-delimited files

A text-delimited file is a file that uses a specific character (the delimiter) to separate columns, with newlines separating rows.

Delimited file requirements

  • Must be UTF-8 encoded (ASCII is a valid subset of UTF-8)

  • Quote characters in cells must be properly escaped. For example, if a cell contains the content: Jane said, "Why hasn't this been figured out by now?" it must be encoded as: "Jane said, ""Why hasn't this been figured out by now?"""

  • The quote character must be used to escape the quote character. For example, the sequence \" is not valid for an escaped quote; it must be ""

  • Empty strings will be converted to null values

Delimited file options

Delimiter

The delimiter will be auto-inferred based upon an analysis of the file being uploaded. In rare cases, this inference may fail; you can specify the delimiter to override this inference.

Quote character

Specify the character used to escape delimiters. Generally " , though some files may not have a quote character (in which case, they must not include the delimiter within any cells).

Has header row

Specifies whether the first row is a header containing the variable names. This will cause data to be read beginning on the 2nd row. If you don't provide a header in your file, variables will be automatically created as var1, var2, var3, etc...

Allow quoted newlines

This option is necessary if there are newline characters within particular cells in your data (e.g., multiple paragraphs in one cell). Checking this box unnecessarily will not cause any errors, though it will substantially slow down data ingest, and may cause inaccuracies in other error reporting.

Skip corrupted records

By default, an upload will fail if a corrupted record is encountered. This includes a record that has a mismatched number of columns, or is otherwise not parsable. If this box is checked, the number of skipped records will be displayed next to each file once it has been imported.

Variable names and types

Renaming variables

Variable names are automatically inferred from the source data, with invalid characters replaced with an underscore (_). If the same variable is found more than once in any given file, it will automatically have a counter appended to it (e.g., "varName2").

Variable type inference

All values of a variable must be compatible with its type. Redivis will automatically choose the most specific, valid type for a variable, with string being the default type.

Please note the following rules:

  • If all values of a variable are null, its type will be string

  • Numeric values with leading zeros will be stored as string in order to preserve the leading zeros (e.g., 000583 )

  • Data stored with decimal values will be stored as a float , even if that value is a valid integer (e.g., 1.0 ).

  • Temporal data types must be formatted as follows:

    • Date: YYYY-[M]M-[D]D

    • DateTime: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]

    • Time: [H]H:[M]M:[S]S[.DDDDDD]

Working with multiple uploads

You can create up to 500 uploads per table, per version. Files will automatically be appended to each other based on their variable names (case insensitive), with the goal of creating one continuous table with a consistent schema.

Conflicting variable types

If files have conflicting types across a given variable, the lowest-denominator type for that variable is chosen when the files are merged.

Import sources

By default, you may upload data from your local computer or a public URL. However, Redivis supports numerous integrations for data ingest across common sources.

Google Cloud Storage (GCS)

You may import any object that you have read access to in GCS by specifying a bucket name and path to that object. You may import multiple objects at once by providing a prefix followed by a wildcard character, e.g.: /my-bucket/my-folder/* or /my-bucket/my-folder/prefix* .

Amazon S3

You may import any object that you have read access to in GCS by specifying a bucket name and path to that object. You may import multiple objects at once by providing a prefix followed by a wildcard character, e.g.: /my-bucket/my-folder/* or /my-bucket/my-folder/prefix* .

Google Drive

You may import any file of valid format that you have stored within your Drive, including Google Sheets.

Google BigQuery

You may import any table that you have read access to in BigQuery. You must specify the table in the form project_name.dataset_id.table_id . To import multiple tables within a dataset, you may use wildcards. E.g., project_name.dataset_id.* or project_name.dataset_id.prefix* .

Please note that importing from table views is not currently supported.

Box

Coming soon.

OneDrive

Coming soon.

Error handling

A file may fail to import due to several reasons; in each case, Redivis endeavors to provide a clear error message for you to fix the error.

In order to view full error information, including a snapshot of where the error occurred in your source file (when applicable), double click on the failed upload in the upload manager

Network issues

When transferring a file from your computer (or more rarely, from other import sources), there may be an interruption to the internet connection that prevents the file from being fully uploaded. In these cases, you should simply try uploading the file again.

Invalid or corrupted source data

Data invalidity is most common when uploading text-delimited files, though it can happen with any file format. While some data invalidity errors may require further investigation off of Redivis, others may be due to incorrect options provided in the file upload process. When possible, Redivis will display ~1000 characters that are near the error in the source file, allowing you to identify the potential source of failure.

For example, the following screenshot highlights content near the error, where we can see that a single cell contains multiple line break characters. By reimporting the file setting the "Allow quoted new lines" option to true, this problem will be resolved.