Today at work I had a funny bug to triage using Google BigQuery, and documentation in this regard is scarce so I thought maybe can be of use to others, even if I'm not 100% sure of the reasoning behind.
Anyway, here it goes: In BigQuery you can have arrays of structs, and there are quite a few operations you can do with those arrays. I was extending a query that builds an array from aggregating some values into a simple struct with a few fields. The resulting aggregate (built with ARRAY_AGG
) was stored (as items
in the sample code below) and later used (to populate a user defined function but that's unrelevant except for the detail that it expects either a NULL
or an ARRAY
with the {start, finish}
-like STRUCT
.
The original query was something like follows:
WITH
listing AS (
SELECT
ARRAY_AGG(
STRUCT (
TIMESTAMP(start) AS start,
TIMESTAMP(finish) AS finish
)
) AS items
FROM
table1
)
[...]
SELECT
listing1.items
[...]
listing1.items
can return two possible types:
- An
ARRAY
ofSTRUCT
withstart
andfinish
- An
ARRAY
ofNULL
instead ofSTRUCT
, or something that gets inferred asNULL
Now, my change was to add another source of items
, from a certain table2
and to be used in the same "final SELECT
" at the end.
BigQuery provides the ARRAY_CONCAT
function to concatenate two of more arrays containing the same type inside. So my first iteration was to simply do another named subquery, and concat upon going to use them.
I found that ARRAY_AGG
doesn't returns a "valid" empty array because if you try to ARRAY_CONCAT(ARRAY_AGG(...) AS a, ARRAY_AGG(...) AS b)
if either a
or b
have no data they'll cause the full concatenation to return "something NULL
". So I add a IFNULL(..., [])
check and voila! it works (note that a COALESCE(..., [])
would also work).
The following code is correct and produces the expected result:
WITH
listing1 AS (
SELECT
ARRAY_AGG(
STRUCT (
TIMESTAMP(start) AS start,
TIMESTAMP(finish) AS finish
)
) AS items
FROM
table1
),
listing2 AS (
SELECT
ARRAY_AGG(
STRUCT (
TIMESTAMP(start) AS start,
TIMESTAMP(finish) AS finish
)
) AS items
FROM
table2
)
[...]
SELECT
ARRAY_CONCAT(
IFNULL(listing1.items, []),
IFNULL(listing2.items, [])
)
[...]
Now, one of my colleages correctly pointed out that I could move the null-handling logic inside of the subquery, so that the subselect always contains/returns a valid array. Something like this:
WITH
listing1 AS (
SELECT
IFNULL(
ARRAY_AGG(
STRUCT (
TIMESTAMP(start) AS start,
TIMESTAMP(finish) AS finish
)
),
[]
) AS items
FROM
table1
),
listing2 AS (
SELECT
IFNULL(
ARRAY_AGG(
STRUCT (
TIMESTAMP(start) AS start,
TIMESTAMP(finish) AS finish
)
),
[]
) AS items
FROM
table2
)
[...]
SELECT
ARRAY_CONCAT(
listing1.items,
listing2.items
)
[...]
And when I tested the query... no results. It failed, same as if I had put no IFNULL()
check at all. I did a few tests and had no way of making it work other than leaving the checks where the ARRAY_CONCAT
was made:
SELECT
ARRAY_CONCAT(
IFNULL(listing1.items, []),
IFNULL(listing2.items, [])
)
[...]
This is just an hypothesis, but BigQuery does some sorts of type inference sometimes, and it might be at play here. For example when you do inserts from selects, if the first value is an INT64
it will set the column type to that, but if the first value is a FLOAT64
you'll get a float. I learned the hard way (by making mistakes) to try to CAST
whenever doing SELECT
into float rows of data that might contain some integer values, etc.
In the non-working-but-better-looking example, what I do is:
- Execute and store a few subqueries, which as mentioned at the beginning will each contain either an
ARRAY
of a certain struct, or whateverARRAY_AGG
returns when doesn't aggregates anything converted into an empty array[]
... but what's the struct of that empty array? probablyNULL
or simply "nothing". - Afterwards,
ARRAY_CONCAT
those subqueries, which need to be of same "type". But seems that it I did theIFNULL
check when calculating the subquery is not valid because the empty array seems to have some struct (or lack of) that doesn't matches with the other subqueries results'.
Instead, in the working-but-not-as-pretty example, what I'm doing is:
- Execute and store a few subqueries, each containing either an
ARRAY
of a certain struct, or whateverARRAY_AGG
's interpretation ofNULL
ARRAY_CONCAT
of the subqueries, but at that precise point, detecting theISNULL
of each array aggregates and casting those that proceed into empty array[]
... which my intuition thinks that BigQuery is able to infer that, as long as other subqueries on thatARRAY_CONCAT
return some values, they will define the expected struct and make that[]
really contain the same structure, just no values.
So, in short, when executing an ARRAY_CONCAT
it might be able to infer the types and make all empty arrays generated at that moment "of the same type", but if you pre-calculate each subquery's empty array value before, types don't match and you get no results.
Now, I think I could have played with CAST
, or maybe try creating an empty array with the correct structure with like SELECT ARRAY<STRUCT...>[]
... but it felt less natural. SQL and BigQuery heavily use NULL
s, so why fighting against them? It's indeed more clear always returning an array, but if the language (or platform) makes it hard to do it that way and there's an easy alternative, I think that sometimes it's best to go for the alternative and avoid trying to outsmart a query engine (and all the engineering effort behind it).
UPDATE: Tested doing a CAST
when performing the subqueries and it doesn't works! Quite interesting...
This what I tested:
WITH
listing1 AS (
SELECT
IFNULL(
ARRAY_AGG(
STRUCT (
TIMESTAMP(start) AS start,
TIMESTAMP(finish) AS finish
)
),
CAST([] AS ARRAY<STRUCT<TIMESTAMP,TIMESTAMP>>)
) AS items
FROM
table1
),
listing2 AS (
SELECT
IFNULL(
ARRAY_AGG(
STRUCT (
TIMESTAMP(start) AS start,
TIMESTAMP(finish) AS finish
)
),
CAST([] AS ARRAY<STRUCT<TIMESTAMP,TIMESTAMP>>)
) AS items
FROM
table2
)
[...]
SELECT
ARRAY_CONCAT(
listing1.items,
listing2.items
)
[...]
Tags: Development