Optimization and errors

Query optimization

The Redivis transform connects to a highly performant, parallelized data store. Queries on terabytes of data can complete in seconds, often utilizing the resources of thousands of compute nodes. These are some best practices that can help you increase performance.

Limit output table size

Table writes are generally much slower than table reads — if your output table is exceptionally large, it may take the querying engine several minutes to materialize the output. Try restricting the number of rows returned by applying row filters to your transforms when possible, and be cognizant of joins that may substantially increase your row count. Avoid keeping variables that aren't needed.

When you are performing initial exploration, you may consider using a limit step in your transform to reduce the output table size.

Reduce the number of new variables

Each new variable adds to the computational complexity of the query; the new variable must be computed for every row in the table.

A common anti-pattern is to construct numerous boolean CASE new variables, and then use the result of these new variables in the row filter(s). If possible, it is far more efficient to inline the CASE logic within the row filters, or within fewer new variables, as this allows for improved efficiency in logical short-circuiting.

Optimize join patterns

When your query utilizes a join step, consider the order in which you are joining the data. The best practice is to place the largest table first, followed by the smallest, and then by decreasing size.

While the query optimizer can determine which table should be on which side of the join, it is still recommended to order your joined tables appropriately.

If all of your joins are INNER joins, the join order will have no impact on the final output. If your query leverages combinations of left / right / inner joins, the join order may affect your output; be careful in these cases.

Common errors

The Redivis transform will prevent you from initiating a run when any steps are invalid, and errors should be rare. However, some errors can only be detected as the query is run — in these cases the job will fail as soon as it encounters an error, logging the error message to the top of the transform.

If you come across an error message not on this list, please email [email protected] for further assistance.

Resources exceeded during query execution

This error occurs when a query utilizes too much memory, yet it is often easily resolvable and due to unintended behavior within the transform. This error is caused by a certain component of the query not being parallelizable, which often occurs when combining and / or ordering many distinct values. We recommend investigating the following culprits:

Investigate any order clauses in your transform — either at the bottom of the transform or in a partitioned query. Often, attempts to order on hundreds of millions or billions of distinct values will fail. Note that ordering rows at the bottom of the transform does not affect your output, and is only useful in preparing your data for export.
Confirm that none of your aggregation methods are creating massive cells. For example, using the String aggregate method on an exceptionally large partition can collapse and concatenate many values into one record — if the cell becomes too big, this error will be thrown.

Cast / type conversion errors

When converting between variable types, all values must be appropriately formatted for conversion to the new type. For example, the value "1,000" is not a valid integer and will throw an error when being converted from a string to an integer.

There are several options for getting around cast errors:

Choose the "If invalid for type, set to null" option in your retype blocks. Note that this will set all invalid values to NULL, potentially causing unintended side effects. Use with caution.
Filter out all records that have incompatible values.
Create a new variable, using the Case method to convert any invalid values to something that can be appropriately cast.

Maximum table size

If an output table is more than 1TB, it cannot exceed the size of the sum of all source tables, + 10%. Very large output tables that substiantially exceed their inputs are typically the result of a misconfigured join that generates a cross-product between a one-to-many or many-to-many relationhip between multiple tables. If you encounter this error, try to apply filter and aggregation steps first, and also validate that your join conditions are appropriately specific.

Too many tables, views and user-defined functions for query: Max: 1000

This error may occur when running queries on tables belonging to an unreleased version, particularly when these tables are made up of hundreds of independent uploads. Under the hood, these unreleased tables are represented as a logical view that stitches the various uploads together into a single table. If your query references multiple unreleased tables, with each approaching the 500 per-table upload limit, it's possible to exceed the total allowed number of tables referenced by a query.

To work around this issue, you can create a transform that simply selects all variables from the unreleased table, materializing the result in your workflow. This output table will now only count as a single table in your query, avoiding this error.

This error will also no longer be an issue once the version is released, as the table is materialized shortly after a version is released.

Other errors

If you come across any other errors or issues while using the transform please contact us directly at [email protected]

Last updated 7 months ago

Was this helpful?