Tabular data
Overview
In order to create new tables, or update existing tables with new data, you will need to ingest a valid tabular data file from a supported data source. Redivis supports numerous tabular data formats, with robust error handling and type inference. You can also perform scripted imports via the programmatic interface.
Tables on Redivis can be made up of one or more tabular file uploads. The multi-upload use case applies when data with the same general schema is broken up across files. For example, a dataset where there is a separate file for each year. In general, it's best to combine such files into a single table, as it is easier for researchers to query a single "tall" table than multiple tables (and Redivis's high-performance query engine keeps these queries quick, even at terabyte scale).
Supported tabular file types
Redivis will try to detect the file type based on the file extension, though you can manually specify the type as needed.
.csv, .tsv, .psv, .dsv, .txt, .tab, *
Delmited
Redivis will auto-infer the delimiter, or you can specify it manually. This is also the default format for files with missing file extensions.
See working with text-delimited files
.jsonl, .ndjson
JSON-lines (a.k.a. Newline-delimited JSON)
A newline-delimited list of JSON objects, with one object on each line. Each objects' keys correspond to the variable names in the table.
.json
JSON
Must be a JSON array of objects, where each top-level object represents one row in the table. The keys of each top-level object correspond to the variable names in the table.
Importing newline-delimited JSON (see above) will be significantly faster and is recommended for larger files. If your file is formatted as GeoJSON, but has the ending .json, make sure to explicitly choose "GeoJSON" as the file format.
.avro
Avro format
Compressed data blocks using the DEFLATE and Snappy codecs are supported.
Nested and repeated fields are not supported.
.parquet
Parquet format
Nested and repeated fields are not supported.
.orc
Orc format
Nested and repeated fields are not supported.
.sas7bdat
SAS data file
Default formats will be interpreted to the corresponding variable type, and variable labels will automatically be imported.
User defined formats (.sas7bcat) are not support.
.dta
Stata data file
Variable labels and value labels will automatically be imported.
.sav
SPSS data file
Variable labels and value labels will automatically be imported.
.xls, .xlsx
Excel file
Only the first sheet will be ingested. The legacy .xls format will have all dates and times represented as dateTimes. Due to the variability of excel files, and inconsistencies in how excel internally represents dates, this formatted is typically not recommended if other options are available.
Google sheets
Sheets file stored in Google Drive
Only the first tab of data will be ingested.
Uploading compressed (gzipped) files:
Generally, you should upload uncompressed data files to Redivis, as uncompressed files can be read in parallel and thus upload substantially faster.
If you prefer to store your source data in a compressed format, Avro, Parquet, and ORC are the preferred data formats, as these support parallelized compressed data ingestion at the row level.
Redivis will decompress text-delimited files, though the data ingest process may be substantially slower. If your file is compressed, it must have the .gz file extension if you're uploading locally (e.g., my_data.csv.gz
) or have it's header set to Content-Encoding: gzip
if served from a URL or cloud storage location.
Quotas & limits
Most upload types are limited to 5TB per upload. Stata, SPSS, XLS, and Shapefile ZIP directories are limited to 100GB.
Full documentation on the limits for upload file size, max variables, and other parameters are specified here.
Working with delimited files
A text-delimited file is a file that uses a specific character (the delimiter) to separate columns, with newlines separating rows.
Delimited file requirements
Must be UTF-8 encoded (ASCII is a valid subset of UTF-8)
Quote characters in cells must be properly escaped. For example, if a cell contains the content:
Jane said, "Why hasn't this been figured out by now?"
it must be encoded as:"Jane said, ""Why hasn't this been figured out by now?"""
The quote character must be used to escape the quote character. For example, the sequence
\"
is not valid for an escaped quote; it must be""
Empty strings will be converted to
null
values
Advanced options for delimited files
Has header row
Specifies whether the first row is a header containing the variable names. This will cause data to be read beginning on the 2nd row. If you don't provide a header in your file, variables will be automatically created as var1, var2, var3, etc...
Skip invalid records
By default, an upload will fail if an invalid record is encountered. This includes a record that has a mismatched number of columns, or is otherwise not parsable. If this box is checked, the number of skipped records will be displayed on each upload once it has been imported.
Allow jagged rows
Whether to allow rows that contain fewer or more columns than the first row of your file. It is recommended to leave this option unchecked, as jagged rows are generally a sign of a parsing error that should be remedied by changing other options or fixing the file.
Cells contain line breaks
Whether newlines exist within specific data cells (e.g., paragraphs of text). If set to Auto, Redivis will determine the value based on analysis of the beginning of the file.
It is best to only set this value to "Yes" if you know your data contain line breaks, as this will slow down the import and may cause incorrect error reporting.
Delimiter
The delimiter will be auto-inferred based upon an analysis of the file being uploaded. In rare cases, this inference may fail; you can specify the delimiter to override this inference.
Quote character
Specify the character used to escape delimiters. Generally "
, though some files may not have a quote character (in which case, they must not include the delimiter within any cells).
Escape character
Cells containing a quote character must have that character escaped.
Typically, the escape sequence character is the same as the quote character, but some files may use a different value, such as a backslash (\).
Variable names and types
Naming variables
Variable names are automatically inferred from the source data. They can only contain alphanumeric or underscore characters, and must start with a letter or underscore. Any invalid characters will be replaced with an underscore (_
).
Variable names must be unique within the table. If the same variable is found more than once in any given file, it will automatically have a counter appended to it (e.g., "variable_2").
The max number of characters for a variable name is 60. Any names with more characters will be truncated.
Variable type inference
All values of a variable must be compatible with its type. Redivis will automatically choose the most specific, valid type for a variable, with string
being the default type.
Please note the following rules:
If all values of a variable are null, its type will be
string
Numeric values with leading zeros will be stored as
string
in order to preserve the leading zeros (e.g.,000583
)Data stored with decimal values will be stored as a
float
, even if that value is a valid integer (e.g.,1.0
).Temporal data types should be formatted using the canonical types below. Redivis will attempt to parse other common date(time) formats, though this will only be successful when the format is unambiguous and internally consistent.
Date:
YYYY-[M]M-[D]D
DateTime:
YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]
Time:
[H]H:[M]M:[S]S[.DDDDDD]
Working with multiple uploads
You can create up to 500 uploads per table, per version. Files will automatically be appended to each other based on their variable names (case insensitive), with the goal of creating one continuous table with a consistent schema.
Missing variables
If a variable is missing in some of the files you uploaded, the values for the missing variable will be set to null for all rows in the upload.
Conflicting variable types
If files have conflicting types across a given variable, the lowest-denominator type for that variable is chosen when the files are combined.
Error handling
A file may fail to import due to several reasons; in each case, Redivis endeavors to provide a clear error message for you to fix the error.
In order to view full error information, including a snapshot of where the error occurred in your source file (when applicable), click on the failed upload in the upload manager
Network issues
When transferring a file from your computer (or more rarely, from other import sources), there may be an interruption to the internet connection that prevents the file from being fully uploaded. In these cases, you should simply try uploading the file again.
Invalid or corrupted source data
Data invalidity is most common when uploading text-delimited files, though it can happen with any file format. While some data invalidity errors may require further investigation outside of Redivis, others may be due to incorrect options provided in the file upload process. When possible, Redivis will display ~1000 characters that are near the error in the source file, allowing you to identify the potential source of failure.
Common import errors
The Redivis data import tool has been built to gracefully handle a wide range of data formats and encodings. However, errors can still occur if the source data is "invalid"; some common problems (and their solutions) are outlined below.
If you're still unable to resolve issue, please don't hesitate to reach out to support@redivis.com; we'd be happy to assist!
Bad CSV dump from SQL database
Some SQL databases and tutorials will generate invalid CSV escape sequences by default. Specifically:
The "proper" escape sequence is a doubling of the quote character. For MySQL, this would look like
If you only have access to the invalid file generated by a previous database dump, you can specify a custom Quote Character of \
in the advanced import options, and Redivis will reformat the file as part of the ingest process (Redivis will also auto-detect this custom escape sequence in many scenarios). Using a custom escape sequence may cause data import processing to take a bit longer.
Line breaks within cells
If your data has paragraphs of text within a particular data cell, and the "Has quoted newlines" advanced option isn't set, the data import may fail (see above screenshot for example). Redivis will automatically set this option to true if it identifies a quoted newline in the top ~1000 records of the file, but if quoted newlines don't occur until later, you'll need to make sure to set this option manually for the import to succeed.
Connectivity and timeout errors
While rare, it is always possible that data transfers will be interrupted by the vagaries of networking. If this happens, we recommend simply retrying your upload. If the problem persists, please reach out to support@redivis.com.
Last updated