Links

Uploading tabular data

Overview

In order to create new tables or update existing ones with new data you will need to select your data from existing sources. Redivis supports data ingest from numerous sources across myriad data formats, as well as the ability to combine multiple uploads into a single table.
For a guided walkthrough of uploading data to a dataset, please see the Creating a dataset guide.

Supported tabular file types

Type
Description
Notes
.csv, .tsv, .psv, .dsv, .txt, .tab, *
Any text-delimited file. Redivis will auto-infer the delimiter, or you can specify it manually. This is also the default format for files with missing file extensions.
See working with text-delimited files
.avro
Avro format
Compressed data blocks using the DEFLATE and Snappy codecs are supported.
Nested and repeated fields are not supported.
.parquet
Parquet format
Nested and repeated fields are not supported.
.orc
Orc format
Nested and repeated fields are not supported.
.json
JSON
Assumes an array of objects, where the objects' keys represent variable names.
The value for each key must be a literal; nested and repeated fields are not supported.
.ndjson, .jsonl
Newline-delimited JSON
Same as the .json specification outlined above, except each object (corresponding to a row of key:value pairs) is given its own line. Importing ndjson (as opposed to json) will be significantly faster.
.geojson
GeoJSON
Assumes an object with a "Features" property, containing an array of valid geojson features. Each feature will be imported as one row, with additional properties mapped to columns in the table. Nested properties will be flattened using the . separator. Note that Redivis only supports 2-dimensional, unprojected (WGS84) geometries. Other projections might cause the import to fail, and any extra dimensions will be stripped during ingest. See working with geospatial data for more information.
.geojsonl, .ndgeojson .geojsons
Same as the .geojson specification outlined above, except each feature is given its own line. Importing .geojsonl (as opposed to .geojson) will be significantly faster.
.kml
Keyhole Markup Language
Will be internally converted to .geojson (via ogr2ogr), and then imported as specified above.
.shp
Shapefile
Will be internally converted to .geojson (via ogr2ogr), and then imported as specified above. Note that the shapefile must use the WGS84 (aka EPSG:4326) projection. If you have additional files associated with your shapefile (e.g., .shx, .proj, .dbf), create a .zip of this folder and import according to the .shp.zip specification below.
.shp.zip
Zipped ESRI shapefile directory
Many shapefiles will be collocated with additional files containing metadata and projection information. These files are often essential to parsing the shapefile correctly, and should be uploaded together. To do so, create a .zip directory of the folder containing your .shp file and supplemental files. The zip file must end in .shp.zip. These will then be converted to .geojson (via ogr2ogr), and imported as specified for the .geojson format.
If projection information is available, the source geometries will be reprojected into WGS84. If no projection information is available, your data must be projected as WGS84, or the import will fail. Note that only one layer can be imported at a time. If you have directory containing multiple shapefiles, create a separate .shp.zip for each layer.
.xls, .xlsx
Excel file
Only the first sheet will be ingested.
.sas7bdat
SAS data file
Default formats will be interpreted to the corresponding variable type, and variable labels will automatically be imported.
User defined formats (.sas7bcat) are not support.
.dta
Stata data file
Variable labels and value labels will automatically be imported.
.sav
SPSS data file
Variable labels and value labels will automatically be imported.
Google sheets
Sheets file stored in Google Drive
Only the first tab of data will be ingested.
Uploading compressed (gzipped) files:
Generally, you should upload uncompressed data files to Redivis, as uncompressed files can be read in parallel and thus upload substantially faster.
If you prefer to store your source data in a compressed format, Avro, Parquet, and ORC are the preferred data formats, as these support parallelized compressed data ingestion at the row level.
Redivis will decompress text-delimited files, though the data ingest process may be substantially slower. If your file is compressed, it must have the .gz file extension if you're uploading locally (e.g., my_data.csv.gz) or have it's header set to Content-Encoding: gzip if served from a URL or cloud storage location.

Quotas & limits

  • Max dataset size: Unlimited
  • Max table size: 100TB
  • Max row size: 100MB
  • Max file sizes
    • Delimited (csv, tsv, etc.): 5TB
    • ndjson, geojson, geojsonl: 5TB
    • Avro, Parquet, ORC: 5TB
    • SAS(.sas7bdat): 500GB
    • json, kml, shp, shp.zip: 25GB
    • Stata(.dta), SPSS(.sav): 25GB
    • Excel(xls, xlsx): 25GB
  • Max tables, per dataset: 10,000
  • Max variables, per table: 9,990 * †
  • Max uploads per table, per version: 500 †
* The variable maximum applies across all versions of a table. If a variable exists in a previous version of the table, but is subsequently deleted, it will still count towards this variable maximum. While Redivis supports up to 9,990 variables per table, we strongly recommend restructuring your data to have fewer variables if possible. Such "wide" tables will generally be less performant and harder for researchers to navigate and query compared to "tall" tables with a few variables and many records.
† Depending on the length of variable names, the actual limits for max variables and uploads may be lower. In general, the total length of all variable names cannot exceed 185,000 characters, and the total number of variables multiplied by the total number of uploads (for a single version of the table) cannot exceed 400,000. If either of these limits are reached, the upload will fail with an accompanying error message.

Working with text-delimited files

A text-delimited file is a file that uses a specific character (the delimiter) to separate columns, with newlines separating rows.

Delimited file requirements

  • Must be UTF-8 encoded (ASCII is a valid subset of UTF-8)
  • Quote characters in cells must be properly escaped. For example, if a cell contains the content: Jane said, "Why hasn't this been figured out by now?" it must be encoded as: "Jane said, ""Why hasn't this been figured out by now?"""
  • The quote character must be used to escape the quote character. For example, the sequence \" is not valid for an escaped quote; it must be ""
  • Empty strings will be converted to null values

Delimited file options

Delimiter
The delimiter will be auto-inferred based upon an analysis of the file being uploaded. In rare cases, this inference may fail; you can specify the delimiter to override this inference.
Quote character
Specify the character used to escape delimiters. Generally " , though some files may not have a quote character (in which case, they must not include the delimiter within any cells).
Has header row
Specifies whether the first row is a header containing the variable names. This will cause data to be read beginning on the 2nd row. If you don't provide a header in your file, variables will be automatically created as var1, var2, var3, etc...
Has quoted newlines
This option is necessary if there are newline characters within particular cells in your data (e.g., multiple paragraphs in one cell). If unchecked, Redivis will still attempt to auto-determine whether your file has quoted newlines, but false negatives may occur if the first quoted newline occurs far within your data file. Choosing this option unnecessarily will not cause any errors, though it will substantially slow down data ingest, and may cause inaccuracies in the error reporting if the file fails to upload for other reasons.
Skip invalid records
By default, an upload will fail if an invalid record is encountered. This includes a record that has a mismatched number of columns, or is otherwise not parsable. If this box is checked, the number of skipped records will be displayed on each upload once it has been imported.
Allow jagged rows
Whether to allow rows that contain fewer or more columns than the first row of your file. It is recommended to leave this option unchecked, as jagged rows are generally a sign of a parsing error that should be remedied by changing other options or fixing the file.

Working with geospatial data

Geospatial file formats

Redivis supports importing geospatial data from several common GIS formats: geojson, shp, shp.zip,kml. Internally, Redivis converts all formats to a geojson representation (using the relevant ogr2ogr driver), and then imports the geojson into a table.
Each feature will be imported as one row, with the geometry column containing the WKT representation for that feature. Additional feature properties will be mapped to variables in your table, with any nested properties flattened using the . separator. Note that Redivis only supports 2-dimensional, unprojected (WGS84) geometries. Other projections might cause the import to fail, and any extra dimensions will be stripped during ingest. If you are uploading a .shp.zip that contains projection information, the geometries will automatically be reprojected as part of the import process.

Geography data in text-delimited files

In addition to uploading geospatial data using one of the formats listed above, you can also import geographic data encoded within a text-delimited file (e.g., a csv). In this case, the geographic data should be encoded as strings using the Well-Known Text (WKT) representation. This is also the same format used when exporting geography variables as a CSV.

Variable names and types

Naming variables
Variable names are automatically inferred from the source data. They must be alphanumeric or underscore, and must start with a letter or underscore. Any invalid characters will be replaced with an underscore (_).
If the same variable is found more than once in any given file, it will automatically have a counter appended to it (e.g., "variable_2").
The max number of characters for a variable name is 60, any names with more characters will be truncated.
Variable type inference
All values of a variable must be compatible with its type. Redivis will automatically choose the most specific, valid type for a variable, with string being the default type.
Please note the following rules:
  • If all values of a variable are null, its type will be string
  • Numeric values with leading zeros will be stored as string in order to preserve the leading zeros (e.g., 000583 )
  • Data stored with decimal values will be stored as a float , even if that value is a valid integer (e.g., 1.0 ).
  • Temporal data types should be formatted using the canonical types below. Redivis will attempt to parse other common date(time) formats, though this will only be successful when the format is unambiguous and internally consistent.
    • Date: YYYY-[M]M-[D]D
    • DateTime: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]
    • Time: [H]H:[M]M:[S]S[.DDDDDD]

Working with multiple uploads

You can create up to 500 uploads per table, per version. Files will automatically be appended to each other based on their variable names (case insensitive), with the goal of creating one continuous table with a consistent schema.

Missing variables

If a variable is missing in some of the files you uploaded, the values for the missing variable will be set to null for all rows in the upload.
Conflicting variable types
If files have conflicting types across a given variable, the lowest-denominator type for that variable is chosen when the files are combined.

Import sources

By default, you may upload data from your local computer or a public URL. However, Redivis supports numerous integrations for data ingest across common sources.

Google Cloud Storage (GCS)

You may import any object that you have read access to in GCS by specifying a bucket name and path to that object. You may import multiple objects at once by providing a prefix followed by wildcard characters, e.g.: /my-bucket/my-folder/* .
The following wildcard characters are supported:
  • * : Match any number of characters within the current directory level. For example, /my-bucket/my-folder/d* matches my-folder/data.csv , but not my-folder/data/text.csv
  • ** : Match any number of characters across directory boundaries. For example, my-folder/d** will match both examples provided above
  • ? : Match a single character. For example, /my-bucket/da??.csv matches /my-bucket/data.csv
  • [chars] : Match any of the specified characters once. For example, /my-bucket/[aeiou].csv matches any of the vowel characters followed by .csv
  • [char range] : Match any of the range of characters once. For example, /my-bucket/[0-9].csv matches any number followed by .csv

Amazon S3

You may import any object that you have read access to in S3 by specifying a bucket name and path to that object. You may import multiple objects at once by providing a prefix followed by a wildcard character, e.g.: /my-bucket/my-folder/* or /my-bucket/my-folder/prefix* .

Google Drive

You may import any file of valid format that you have stored within your Drive, including Google Sheets.

Google BigQuery

You may import any table that you have read access to in BigQuery. You must specify the table in the form project_name.dataset_id.table_id . To import multiple tables within a dataset, you may use wildcards. E.g., project_name.dataset_id.* or project_name.dataset_id.prefix* .
Please note that importing from table views is not currently supported.

Redivis

Import a table from a Redivis dataset or project that you have access to. You must be able to export the table in order to import the data to a dataset. Learn more about how this enables ETL workflows here.

Box

You may import any file of valid format that you have stored within Box.

OneDrive

Coming soon. Please contact [email protected] if this integration would be helpful for your use case so that we can prioritize.

Scripted and streaming imports

In addition to uploading data through the browser interface, you can leverage the redivis-python and redivis-js client libraries to automate data ingest and data release pipelines. These libraries can be used for individual file uploads similar to the interface, as well as for streaming data ingest pipelines.
Consult the complete client library documentation for more details and additional examples:

Error handling

A file may fail to import due to several reasons; in each case, Redivis endeavors to provide a clear error message for you to fix the error.
In order to view full error information, including a snapshot of where the error occurred in your source file (when applicable), click on the failed upload in the upload manager

Network issues

When transferring a file from your computer (or more rarely, from other import sources), there may be an interruption to the internet connection that prevents the file from being fully uploaded. In these cases, you should simply try uploading the file again.

Invalid or corrupted source data

Data invalidity is most common when uploading text-delimited files, though it can happen with any file format. While some data invalidity errors may require further investigation outside of Redivis, others may be due to incorrect options provided in the file upload process. When possible, Redivis will display ~1000 characters that are near the error in the source file, allowing you to identify the potential source of failure.

Common import errors

The Redivis data import tool has been built to gracefully handle a wide range of data formats and encodings. However, errors can still occur if the source data is "invalid"; some common problems (and their solutions) are outlined below.
If you're still unable to resolve issue, please don't hesitate to reach out to [email protected]; we'd be happy to assist!

Bad CSV dump from SQL database

Some SQL databases and tutorials will generate invalid CSV escape sequences by default. Specifically:
Incorrect encoding:
val1,val2,"string with \"quotes\" inside"
Correct encoding:
val1,val2,"string with ""quotes"" inside"
The "proper" escape sequence is a doubling of the quote character. For MySQL, this would look like
SELECT ....
INTO OUTFILE '/.../out.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"', ESCAPED BY '"'
LINES TERMINATED BY '\n';
If you only have access to the invalid file generated by a previous database dump, you can specify a custom Quote Character of \ in the advanced import options, and Redivis will reformat the file as part of the ingest process (Redivis will also auto-detect this custom escape sequence in many scenarios). Using a custom escape sequence may cause data import processing to take a bit longer.

Line breaks within cells

If your data has paragraphs of text within a particular data cell, and the "Has quoted newlines" advanced option isn't set, the data import may fail (see above screenshot for example). Redivis will automatically set this option to true if it identifies a quoted newline in the top ~1000 records of the file, but if quoted newlines don't occur until later, you'll need to make sure to set this option manually for the import to succeed.

Connectivity and timeout errors

While rare, it is always possible that data transfers will be interrupted by the vagaries of networking. If this happens, we recommend simply retrying your upload. If the problem persists, please reach out to [email protected]