Skip to main content

3 posts tagged with "functions"

View All Tags

The hidden traps of regex in LIKE and split

· 6 min read
Masha Basmanova
Software Engineer @ Meta

SQL functions sometimes use regular expressions under the hood in ways that surprise users. Two common examples are the LIKE operator and Spark's split function.

In Presto, split takes a literal string delimiter and regexp_split is a separate function for regex-based splitting. Spark's split, however, always treats the delimiter as a regular expression.

Both LIKE and Spark's split can silently produce wrong results and waste CPU when used with column values instead of constants. Understanding why this happens helps write faster, more correct queries — and helps engine developers make better design choices.

LIKE is not contains

A very common query pattern is to check whether one string contains another:

SELECT * FROM t WHERE name LIKE '%' || search_term || '%'

This looks intuitive: wrap search_term in % wildcards and you get a "contains" check. But LIKE is not the same as substring matching. LIKE treats _ as a single-character wildcard and % as a multi-character wildcard. If search_term comes from a column and contains these characters, the results are silently wrong:

SELECT url,
url LIKE '%' || search_term || '%' AS like_result,
strpos(url, search_term) > 0 AS contains_result
FROM (VALUES
('https://site.com/home'),
('https://site.com/user_profile'),
('https://site.com/username')
) AS t(url)
CROSS JOIN (VALUES ('user_')) AS s(search_term);

url | like_result | contains_result
-------------------------------+-------------+----------------
https://site.com/home | false | false
https://site.com/user_profile | true | true
https://site.com/username | true | false

LIKE '%user_%' matches 'https://site.com/username' because _ is a wildcard that matches any single character — in this case, n. But strpos(url, 'user_') > 0 treats _ as a literal underscore and correctly reports that 'https://site.com/username' does not contain the substring 'user_'.

When the pattern is a constant, this distinction is visible and intentional. But when users write x LIKE '%' || y || '%' where y is a column, the values of y may contain _ or % characters — and they will be silently interpreted as wildcards, producing wrong results.

Spark's split treats delimiters as regex

In Presto, the split function takes a literal string delimiter, while regexp_split is a separate function for regex-based splitting. This distinction makes the intent clear.

Spark's split function, however, always treats the delimiter as a regular expression. Users rarely realize this, and a common pattern is to split a string using a value from another column:

select split(dir_path, location_path)[1] as partition_name from t

Here, a table stores Hive partition metadata: dir_path is the full partition path (e.g., /data/warehouse/db.name/table/ds=2024-01-01) and location_path is the table path (e.g., /data/warehouse/db.name/table). The user wants to strip the table path prefix to get the partition name.

This works for simple paths. But location_path is interpreted as a regular expression, not a literal string. If it contains . — as in db.name — the . matches any character, not a literal dot. Characters like (, ), [, +, *, ?, and $ would also cause wrong results or errors.

A correct alternative that also executes faster uses simple string operations:

IF(starts_with(dir_path, location_path),
substr(dir_path, length(location_path) + 2)) as partition_name

This is a bit more verbose than split(dir_path, location_path)[1], but it is correct for all inputs and avoids regex compilation entirely.

Performance trap

Beyond correctness, there is a performance problem. Both LIKE and Spark's split use RE2 as the regex engine. RE2 is fast and safe, but compiling a regular expression can take up to 200x more CPU time than evaluating it.

When the pattern or delimiter is a constant, the regex is compiled once and reused for every row. The cost is negligible. But when the pattern comes from a column, a new regex may need to be compiled for every distinct value. A table with thousands of distinct location_path values means thousands of regex compilations — each one expensive and none of them necessary.

Velox limits the number of compiled regular expressions per function instance per thread of execution via the expression.max_compiled_regexes configuration property (default: 100). When this limit is reached, the query fails with an error.

Tempting but wrong fix

When users hit this limit, the natural reaction is to ask the engine developers to raise or eliminate the cap. A recent pull request proposed replacing the fixed-size cache with an evicting cache: when the limit is reached, the oldest compiled regex is evicted to make room for the new one.

This sounds reasonable, and the motivation is understandable — users migrating from Spark don't want to rewrite working queries. But it makes things worse:

  • It hides the correctness bug. The query no longer fails, so users never discover that their LIKE pattern or split delimiter is being interpreted as a regex and producing wrong results for inputs with special characters.
  • It makes the performance problem worse. With thousands of distinct patterns, the cache churns constantly — evicting one compiled regex only to compile another. The query runs, but dramatically slower than necessary, and the user has no indication why. In shared multi-tenant clusters, a single slow query like this can consume excessive CPU and affect other users' workloads.

The error is a feature, not a bug. It is an early warning that catches misuse before it leads to silently wrong results in production and prevents a single query from wasting shared cluster resources.

Right fix

For users: replace LIKE with literal string operations when checking for substrings. Use strpos(x, y) > 0 or contains(x, y) instead of x LIKE '%' || y || '%'. For Spark's split with literal delimiters, use substr or other string functions that don't involve regex.

For engine developers: optimize the functions to avoid regex when it isn't needed. Velox's LIKE implementation already does this. As described in James Xu's earlier blog post, the engine analyzes each pattern and uses fast paths — prefix match, suffix match, substring search — whenever the pattern contains only regular characters and _ wildcards. For simple patterns, this gives up to 750x speedup over regex. Regex is compiled only for patterns that truly require it, and these optimized patterns are not counted toward the compiled regex limit.

The same approach should be applied to Spark's split function. The engine can check whether the delimiter contains any regex metacharacters. If it doesn't, a simple string search can be used instead of compiling a regex. This would make queries like split(dir_path, location_path) both fast and correct — without users needing to change anything and without removing the safety net for cases that genuinely require regex.

Takeaways

  • LIKE is not contains. The _ and % wildcards can silently corrupt results when the pattern comes from a column.
  • Spark's split treats delimiters as regex. Characters like . in column values are interpreted as regex metacharacters, not literal characters. Presto avoids this by separating split (literal) and regexp_split (regex).
  • When a query hits the compiled regex limit, the right response is to fix the query, not to raise the limit.
  • Engine developers should optimize functions to avoid regex when the input is a plain string, rather than making it easier to misuse regex at scale.

reduce_agg lambda aggregate function

· 5 min read
Masha Basmanova
Software Engineer @ Meta

Definition

Reduce_agg is the only lambda aggregate Presto function. It allows users to define arbitrary aggregation logic using 2 lambda functions.

reduce_agg(inputValue T, initialState S, inputFunction(S, T, S), combineFunction(S, S, S)) → S

Reduces all non-NULL input values into a single value. inputFunction will be invoked for
each non-NULL input value. If all inputs are NULL, the result is NULL. In addition to taking
the input value, inputFunction takes the current state, initially initialState, and returns the
new state. combineFunction will be invoked to combine two states into a new state. The final
state is returned. Throws an error if initialState is NULL or inputFunction or combineFunction
returns a NULL.

Once can think of reduce_agg as using inputFunction to implement partial aggregation and combineFunction to implement final aggregation. Partial aggregation processes a list of input values and produces an intermediate state:

auto s = initialState;
for (auto x : input) {
s = inputFunction(s, x);
}

return s;

Final aggregation processes a list of intermediate states and computes the final state.

auto s = intermediates[0];
for (auto i = 1; i < intermediates.size(); ++i)
s = combineFunction(s, intermediates[i]);
}

return s;

For example, one can implement SUM aggregation using reduce_agg as follows:

reduce_agg(c, 0, (s, x) -> s + x, (s, s2) -> s + s2)

Implementation of AVG aggregation is a bit trickier. For AVG, state is a tuple of sum and count. Hence, reduce_agg can be used to compute (sum, count) pair, but it cannot compute the actual average. One needs to apply a scalar function on top of reduce_agg to get the average.

SELECT id, sum_and_count.sum / sum_and_count.count FROM (
SELECT id, reduce_agg(value, CAST(row(0, 0) AS row(sum double, count bigint)),
(s, x) -> CAST(row(s.sum + x, s.count + 1) AS row(sum double, count bigint)),
(s, s2) -> CAST(row(s.sum + s2.sum, s.count + s2.count) AS row(sum double, count bigint))) AS sum_and_count
FROM t
GROUP BY id
);

The examples of using reduce_agg to compute SUM and AVG are for illustrative purposes. One should not use reduce_agg if a specialized aggregation function is available.

One use case for reduce_agg we see in production is to compute a product of input values.

reduce_agg(c, 1.0, (s, x) -> s * x, (s, s2) -> s * s2)

Another example is to compute a list of top N distinct values from all input arrays.

reduce_agg(x, array[],
(a, b) -> slice(reverse(array_sort(array_distinct(concat(a, b)))), 1, 1000),
(a, b) -> slice(reverse(array_sort(array_distinct(concat(a, b)))), 1, 1000))

Note that this is equivalent to the following query:

SELECT array_agg(v) FROM (
SELECT DISTINCT v
FROM t, UNNEST(x) AS u(v)
ORDER BY v DESC
LIMIT 1000
)

Implementation

Efficient implementation of reduce_agg lambda function is not straightforward. Let’s consider the logic for partial aggregation.

auto s = initialState;
for (auto x : input) {
s = inputFunction(s, x);
}

This is a data-dependent loop, i.e. the next loop iteration depends on the results of the previous iteration. inputFunction needs to be invoked on each input value x separately. Since inputFunction is a user-defined lambda, invoking inputFunction means evaluating an expression. And since expression evaluation in Velox is optimized for processing large batches of values at a time, evaluating expressions on one value at a time is very inefficient. To optimize the implementation of reduce_agg we need to reduce the number of times we evaluate user-defined lambdas and increase the number of values we process each time.

One approach is to

  1. convert all input values into states by evaluating inputFunction(initialState, x);
  2. split states into pairs and evaluate combineFunction on all pairs;
  3. repeat step (2) until we have only one state left.

Let’s say we have 1024 values to process. Step 1 evaluates inputFunction expression on 1024 values at once. Step 2 evaluates combineFunction on 512 pairs, then on 256 pairs, then on 128 pairs, 64, 32, 16, 8, 4, 2, finally producing a single state. Step 2 evaluates combineFunction 9 times. In total, this implementation evaluates user-defined expressions 10 times on multiple values each time. This is a lot more efficient than the original implementation that evaluates user-defined expressions 1024 times.

In general, given N inputs, the original implementation evaluates expressions N times while the new one log2(N) times.

Note that in case when N is not a power of two, splitting states into pairs may leave an extra state. For example, splitting 11 states produces 5 pairs + one extra state. In this case, we set aside the extra state, evaluate combineFunction on 5 pairs, then bring extra state back to a total of 6 states and continue.

A benchmark, velox/functions/prestosql/aggregates/benchmarks/ReduceAgg.cpp, shows that initial implementation of reduce_agg is 60x slower than SUM, while the optimized implementation is only 3x slower. A specialized aggregation function will always be more efficient than generic reduce_agg, hence, reduce_agg should be used only when specialized aggregation function is not available.

Finally, a side effect of the optimized implementation is that it doesn't support applying reduce_agg to sorted inputs. I.e. one cannot use reduce_agg to compute an equivalent of

	SELECT a, array_agg(b ORDER BY b) FROM t GROUP BY 1

The array_agg computation depends on order of inputs. A comparable implementation using reduce_agg would look like this:

	SELECT a,
reduce_agg(b, array[],
(s, x) -> concat(s, array[x]),
(s, s2) -> concat(s, s2)
ORDER BY b)
FROM t GROUP BY 1

To respect ORDER BY b, the reduce_agg would have to apply inputFunction to each input value one at a time using a data-dependent loop from above. As we saw, this is very expensive. The optimization we apply does not preserve the order of inputs, hence, cannot support the query above. Note that Presto doesn't support applying reduce_agg to sorted inputs either.

Thank you Orri Erling for brainstorming and Xiaoxuan Meng and Pedro Eugênio Rocha Pedreira for reviewing the code.

array_sort lambda function

· 6 min read
Masha Basmanova
Software Engineer @ Meta

Presto provides an array_sort function to sort arrays in ascending order with nulls placed at the end.

presto> select array_sort(array[2, 5, null, 1, -1]);
_col0
---------------------
[-1, 1, 2, 5, null]

There is also an array_sort_desc function that sorts arrays in descending order with nulls placed at the end.

presto> select array_sort_desc(array[2, 5, null, 1, -1]);
_col0
---------------------
[5, 2, 1, -1, null]

Both array_sort and array_sort_desc place nulls at the end of the array.

There is also a version of array_sort function that takes a comparator lambda function and uses it to sort the array.

array_sort(array(T), function(T, T, int)) -> array(T)

A common use case is to sort an array of structs using one of the struct fields as the sorting key.

presto> select array_sort(array[row('apples', 23), row('bananas', 12), row('grapes', 44)],
-> (x, y) -> if (x[2] < y[2], -1, if(x[2] > y[2], 1, 0)));

_col0
---------------------------------------------------------------------------------------
[{f0=bananas, f1=12}, {f0=apples, f1=23}, {f0=grapes, f1=44}]

This is all very nice and convenient, but there is a catch.

The documentation says that the "comparator will take two nullable arguments representing two nullable elements of the array."" Did you notice the word "nullable" in "nullable arguments" and "nullable elements"? Do you think it is important? It is Ok if the answer is No or Not Really. Turns out this "nullable" thing is very important. The comparator is expected to handle null inputs and should not assume that inputs are not null or that nulls are handled automatically.

Why is it important to handle null inputs? Let’s see what happens if the comparator doesn’t handle nulls.

presto> select array_sort(array[2, 3, null, 1],
(x, y) -> if (x < y, -1, if (x > y, 1, 0)));
_col0
-----------------
[2, 3, null, 1]

The result array is not sorted! If subsequent logic relies on the array to be sorted the query will silently return wrong results. And if there is no logic that relies on the sortedness of the array then why waste CPU cycles on sorting?

Why is the array not sorted? That’s because the comparator returns 0 whenever x or y is null.

	x < y  returns null if x or y is null, then
x > y returns null if x or y is null, then
result is 0

This confuses the sorting algorithm as it sees that 1 == null, 2 == null, 3 == null, but 1 != 2 and 1 != 3. The algorithm assumes that the comparator function is written correctly, e.g. if a < b then b > a and if a == b and b == c then a == c. Comparator function that doesn’t handle nulls does not satisfy these rules and causes unpredictable results.

To handle null inputs, the comparator function needs to be modified, for example, like so:

	(x, y) -> CASE WHEN x IS NULL THEN 1
WHEN y IS NULL THEN -1
WHEN x < y THEN -1
WHEN x > y THEN 1
ELSE 0 END
presto> select array_sort(array[2, 3, null, 1],
-> (x, y) -> CASE WHEN x IS NULL THEN 1
-> WHEN y IS NULL THEN -1
-> WHEN x < y THEN -1
-> WHEN x > y THEN 1
-> ELSE 0 END
-> );
_col0
-----------------
[1, 2, 3, null]

This is longer and harder to read, but the result array is sorted properly. The new comparator says that null is greater than any other value, so null is placed at the end of the array.

Note: When (x, y) return -1, the algorithm assumes that x <= y.

Writing comparators correctly is not easy. Writing comparators that handle null inputs is even harder. Having no feedback when a comparator is written incorrectly makes it yet harder to spot bugs and fix them before a query lands in production and starts producing wrong results.

I feel that Presto’s array_sort function with a custom comparator is dangerous and hard to use and I wonder if it makes sense to replace it with a safer, easier to use alternative.

array_sort(array(T), function(T, U)) -> array(T)

This version takes an array and a transform lambda function that specifies how to compute sorting keys from the array elements.

To sort array of structs by one of the struct fields, one would write

presto> select array_sort(array[row('apples', 23), row('bananas', 12), row('grapes', 44)],
x -> x[2])

_col0
---------------------------------------------------------------------------------------
[{f0=bananas, f1=12}, {f0=apples, f1=23}, {f0=grapes, f1=44}]

This version would sort the array by the sorting keys computed using the specified lambda in ascending order placing nulls at the end of the array.

A matching array_sort_desc function would sort in descending order placing nulls at the end of the array.

These functions would be easier to write and read and null handling will happen automatically.

We implemented these functions in Velox.

We also added partial support for array_sort with a comparator lambda function. Expression compiler in Velox analyzes the comparator expression to determine whether it can be re-written to the alternative version of array_sort. If so, it re-writes the expression and evaluates it. Otherwise, it throws an unsupported exception.

For example,

	array_sort(a, (x, y) -> if (x[2] < y[2], -1, if(x[2] > y[2], 1, 0)))

is re-written to

	array_sort(a, x -> x[2])

This rewrite allows Prestissimo and Presto-on-Spark-on-Velox to support common use cases and do so efficiently.

The rewrite handles a few different ways to express the same comparator. Some examples:

    // becomes array_sort(a, f(x))
(x, y) -> if(f(x) < f(y), -1, if(f(x) > f(y), 1, 0))

// becomes array_sort_desc(a, f(x))
(x, y) -> if(f(x) < f(y), 1, if(f(x) > f(y), -1, 0))

// becomes array_sort(a, f(x))
(x, y) -> if(f(x) < f(y), -1, if(f(x) = f(y), 0, 1))

// becomes array_sort(a, f(x))
(x, y) -> if(f(x) = f(y), 0, if(f(x) < f(y), -1, 1))

// becomes array_sort(a, f(x))
(x, y) -> if(f(y) < f(x), 1, if(f(x) < f(y), -1, 0))

Why didn’t we implement full support for comparator lambda functions in array_sort? Actually, we couldn’t think of an efficient way to do that in a vectorized engine. Velox doesn’t use code generation and interprets expressions. It can do that efficiently if it can process data in large batches. array_sort with custom comparator doesn’t lend itself well to such processing.

array_sort with a transform lambda works well in a vectorized engine. To process a batch of arrays, Velox first evaluates the transform lambda on all the elements of the arrays, then sorts the results.

For further reading, consider the Vectorized and performance-portable Quicksort blog post from Google.

Thank you Orri Erling for brainstorming and Xiaoxuan Meng for reviewing the code.