Tabular data

Overview

In order to create new tables, or update existing tables with new data, you will need to ingest a valid tabular data file from a supported data source. Redivis supports numerous tabular data formats, with robust error handling and type inference. You can also perform scripted imports via the programmatic interface.

Tables on Redivis can be made up of one or more tabular file uploads. The multi-upload use case applies when data with the same general schema is broken up across files. For example, a dataset where there is a separate file for each year. In general, it's best to combine such files into a single table, as it is easier for researchers to query a single "tall" table than multiple tables (and Redivis's high-performance query engine keeps these queries quick, even at terabyte scale).

Supported tabular file types

Redivis will try to detect the file type based on the file extension, though you can manually specify the type as needed.

Type

Description

Notes

.csv, .tsv, .psv, .dsv, .txt, .tab, *

Delmited

Redivis will auto-infer the delimiter, or you can specify it manually. This is also the default format for files with missing file extensions.

See working with text-delimited files

.jsonl, .ndjson

JSON-lines (a.k.a. Newline-delimited JSON)

A newline-delimited list of JSON objects, with one object on each line. Each objects' keys correspond to the variable names in the table.

.json

JSON

Must be a JSON array of objects, where each top-level object represents one row in the table. The keys of each top-level object correspond to the variable names in the table.

Importing newline-delimited JSON (see above) will be significantly faster and is recommended for larger files. If your file is formatted as GeoJSON, but has the ending .json, make sure to explicitly choose "GeoJSON" as the file format.

.avro

Avro format

Compressed data blocks using the DEFLATE and Snappy codecs are supported.

Nested and repeated fields are not supported.

.parquet

Parquet format

Nested and repeated fields are not supported.

.orc

Orc format

Nested and repeated fields are not supported.

.sas7bdat

SAS data file

Default formats will be interpreted to the corresponding variable type, and variable labels will automatically be imported.

User defined formats (.sas7bcat) are not support.

.dta

Stata data file

Variable labels and value labels will automatically be imported.

.sav

SPSS data file

Variable labels and value labels will automatically be imported.

.xls, .xlsx

Excel file

Only the first sheet will be ingested. The legacy .xls format will have all dates and times represented as dateTimes. Due to the variability of excel files, and inconsistencies in how excel internally represents dates, this formatted is typically not recommended if other options are available.

Google sheets

Sheets file stored in Google Drive

Only the first tab of data will be ingested.

Uploading compressed (gzipped) files:

Generally, you should upload uncompressed data files to Redivis, as uncompressed files can be read in parallel and thus upload substantially faster.

If you prefer to store your source data in a compressed format, Avro, Parquet, and ORC are the preferred data formats, as these support parallelized compressed data ingestion at the row level.

Redivis will decompress text-delimited files, though the data ingest process may be substantially slower. If your file is compressed, it must have the .gz file extension if you're uploading locally (e.g., my_data.csv.gz) or have it's header set to Content-Encoding: gzip if served from a URL or cloud storage location.

Quotas & limits

Most upload types are limited to 5TB per upload. Stata, SPSS, XLS, and Shapefile ZIP directories are limited to 100GB.

Full documentation on the limits for upload file size, max variables, and other parameters are specified here.

Working with delimited files

A text-delimited file is a file that uses a specific character (the delimiter) to separate columns, with newlines separating rows.

Delimited file requirements

Must be UTF-8 encoded (ASCII is a valid subset of UTF-8)
Quote characters in cells must be properly escaped. For example, if a cell contains the content: Jane said, "Why hasn't this been figured out by now?" it must be encoded as: "Jane said, ""Why hasn't this been figured out by now?"""
The quote character must be used to escape the quote character. For example, the sequence \" is not valid for an escaped quote; it must be ""
Empty strings will be converted to null values

Advanced options for delimited files

Has header row

Specifies whether the first row is a header containing the variable names. This will cause data to be read beginning on the 2nd row. If you don't provide a header in your file, variables will be automatically created as var1, var2, var3, etc...

Skip invalid records

By default, an upload will fail if an invalid record is encountered. This includes a record that has a mismatched number of columns, or is otherwise not parsable. If this box is checked, the number of skipped records will be displayed on each upload once it has been imported.

Allow jagged rows

Whether to allow rows that contain fewer or more columns than the first row of your file. It is recommended to leave this option unchecked, as jagged rows are generally a sign of a parsing error that should be remedied by changing other options or fixing the file.

Cells contain line breaks

Whether newlines exist within specific data cells (e.g., paragraphs of text). If set to Auto, Redivis will determine the value based on analysis of the beginning of the file.

It is best to only set this value to "Yes" if you know your data contain line breaks, as this will slow down the import and may cause incorrect error reporting.

Delimiter

The delimiter will be auto-inferred based upon an analysis of the file being uploaded. In rare cases, this inference may fail; you can specify the delimiter to override this inference.

Quote character

Specify the character used to escape delimiters. Generally " , though some files may not have a quote character (in which case, they must not include the delimiter within any cells).

Escape character

Cells containing a quote character must have that character escaped.

Typically, the escape sequence character is the same as the quote character, but some files may use a different value, such as a backslash (\).

Null markers

A list of up to 10 values (case-sensitive) that should be interpreted as NULL on import. For example, if your file contains the value NA to represent nulls, you can specify this value here such that these values are read as nulls, rather than the string literal "NA".

Variable names and types

Naming variables

Variable names are automatically inferred from the source data. They can only contain alphanumeric or underscore characters, and must start with a letter or underscore. Any invalid characters will be replaced with an underscore (_).

Variable names must be unique within the table. If the same variable is found more than once in any given file, it will automatically have a counter appended to it (e.g., "variable_2").

The max number of characters for a variable name is 60. Any names with more characters will be truncated.

Variable type inference

All values of a variable must be compatible with its type. Redivis will automatically choose the most specific, valid type for a variable, with string being the default type.

Please note the following rules:

If all values of a variable are null, its type will be string
Numeric values with leading zeros will be stored as string in order to preserve the leading zeros (e.g., 000583 )
Data stored with decimal values will be stored as a float , even if that value is a valid integer (e.g., 1.0 ).
Temporal data types should be formatted using the canonical types below. Redivis will attempt to parse other common date(time) formats, though this will only be successful when the format is unambiguous and internally consistent.
- Date: YYYY-[M]M-[D]D
- DateTime: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]
- Time: [H]H:[M]M:[S]S[.DDDDDD]

Working with multiple uploads

You can create up to 500 uploads per table, per version. Files will automatically be appended to each other based on their variable names (case insensitive), with the goal of creating one continuous table with a consistent schema.

If you have more than 500 files, and these files are located in Google Cloud Storage with consistent schema, you can perform a wildcard upload via the API (Python, R), allowing for the import of up to 10 million files as a single upload.

Missing variables

If a variable is missing in some of the files you uploaded, the values for the missing variable will be set to null for all rows in the upload.

Conflicting variable types

If files have conflicting types across a given variable, the lowest-denominator type for that variable is chosen when the files are combined.

Pseudo-variables associated with uploads

In certain situations, there may be additional metadata associated with an upload that isn't otherwise present in your data (e.g., a date that is encoded in the file name). To allow you to add such information within your data cleaning workflows, Redivis exposes two "pseudo" variables on all unreleased tables:

_IMPORT_ID: The unique identifier associated with the upload. This will most often be used when initiating uploads via the API.
_IMPORT_NAME: The name of the upload, typically the associated file name.

These pseudo-variables won't be shown in the table's list of variables and won't be present when listing rows, but can be queried via SQL and will show up alongside other variables when cleaning unreleased data in transforms. After a table is released, these pseudo-variables will no longer be available.

Error handling

A file may fail to import due to several reasons; in each case, Redivis endeavors to provide a clear error message for you to fix the error.

In order to view full error information, including a snapshot of where the error occurred in your source file (when applicable), click on the failed upload in the upload manager

Network issues

When transferring a file from your computer (or more rarely, from other import sources), there may be an interruption to the internet connection that prevents the file from being fully uploaded. In these cases, you should simply try uploading the file again.

Invalid or corrupted source data

Data invalidity is most common when uploading text-delimited files, though it can happen with any file format. While some data invalidity errors may require further investigation outside of Redivis, others may be due to incorrect options provided in the file upload process. When possible, Redivis will display ~1000 characters that are near the error in the source file, allowing you to identify the potential source of failure.

Common import errors

The Redivis data import tool has been built to gracefully handle a wide range of data formats and encodings. However, errors can still occur if the source data is "invalid"; some common problems (and their solutions) are outlined below.

If you're still unable to resolve issue, please don't hesitate to reach out to [email protected]; we'd be happy to assist!

Bad CSV dump from SQL database

Some SQL databases and tutorials will generate invalid CSV escape sequences by default. Specifically:

Incorrect encoding:
val1,val2,"string with \"quotes\" inside"

Correct encoding:
val1,val2,"string with ""quotes"" inside"

The "proper" escape sequence is a doubling of the quote character. For MySQL, this would look like

SELECT .... 
INTO OUTFILE '/.../out.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"', ESCAPED BY '"'
LINES TERMINATED BY '\n';

If you only have access to the invalid file generated by a previous database dump, you can specify a custom Quote Character of \ in the advanced import options, and Redivis will reformat the file as part of the ingest process (Redivis will also auto-detect this custom escape sequence in many scenarios). Using a custom escape sequence may cause data import processing to take a bit longer.

Line breaks within cells

If your data has paragraphs of text within a particular data cell, and the "Has quoted newlines" advanced option isn't set, the data import may fail (see above screenshot for example). Redivis will automatically set this option to true if it identifies a quoted newline in the top ~1000 records of the file, but if quoted newlines don't occur until later, you'll need to make sure to set this option manually for the import to succeed.

Connectivity and timeout errors

While rare, it is always possible that data transfers will be interrupted by the vagaries of networking. If this happens, we recommend simply retrying your upload. If the problem persists, please reach out to [email protected].

Last updated 2 days ago

Was this helpful?