Type inference complex scenarios - A BigQuery example

Today at work I had a funny bug to triage using Google BigQuery, and documentation in this regard is scarce so I thought maybe can be of use to others, even if I'm not 100% sure of the reasoning behind.

Anyway, here it goes: In BigQuery you can have arrays of structs, and there are quite a few operations you can do with those arrays. I was extending a query that builds an array from aggregating some values into a simple struct with a few fields. The resulting aggregate (built with ARRAY_AGG) was stored (as items in the sample code below) and later used (to populate a user defined function but that's unrelevant except for the detail that it expects either a NULL or an ARRAY with the {start, finish}-like STRUCT.

The original query was something like follows:

WITH
    listing AS (
        SELECT
            ARRAY_AGG(
                STRUCT (
                    TIMESTAMP(start) AS start,
                    TIMESTAMP(finish) AS finish
                )
            ) AS items
        FROM
            table1
    )

[...]

SELECT 
    listing1.items
[...]

listing1.items can return two possible types:

  • An ARRAY of STRUCT with start and finish
  • An ARRAY of NULL instead of STRUCT, or something that gets inferred as NULL

Now, my change was to add another source of items, from a certain table2 and to be used in the same "final SELECT" at the end. BigQuery provides the ARRAY_CONCAT function to concatenate two of more arrays containing the same type inside. So my first iteration was to simply do another named subquery, and concat upon going to use them.

I found that ARRAY_AGG doesn't returns a "valid" empty array because if you try to ARRAY_CONCAT(ARRAY_AGG(...) AS a, ARRAY_AGG(...) AS b) if either a or b have no data they'll cause the full concatenation to return "something NULL". So I add a IFNULL(..., []) check and voila! it works (note that a COALESCE(..., []) would also work).

The following code is correct and produces the expected result:

WITH
    listing1 AS (
        SELECT
            ARRAY_AGG(
                STRUCT (
                    TIMESTAMP(start) AS start,
                    TIMESTAMP(finish) AS finish
                )
            ) AS items
        FROM
            table1
    ),
    listing2 AS (
        SELECT
            ARRAY_AGG(
                STRUCT (
                    TIMESTAMP(start) AS start,
                    TIMESTAMP(finish) AS finish
                )
            ) AS items
        FROM
            table2
    )

[...]

SELECT 
    ARRAY_CONCAT(
        IFNULL(listing1.items, []),
        IFNULL(listing2.items, [])
    )
[...]

Now, one of my colleages correctly pointed out that I could move the null-handling logic inside of the subquery, so that the subselect always contains/returns a valid array. Something like this:

WITH
    listing1 AS (
        SELECT
            IFNULL(
                ARRAY_AGG(
                    STRUCT (
                        TIMESTAMP(start) AS start,
                        TIMESTAMP(finish) AS finish
                    )
                ),
                []
            ) AS items
        FROM
            table1
    ),
    listing2 AS (
        SELECT
            IFNULL(
                ARRAY_AGG(
                    STRUCT (
                        TIMESTAMP(start) AS start,
                        TIMESTAMP(finish) AS finish
                    )
                ),
                []
            ) AS items
        FROM
            table2
    )

[...]

SELECT 
    ARRAY_CONCAT(
        listing1.items,
        listing2.items
    )
[...]

And when I tested the query... no results. It failed, same as if I had put no IFNULL() check at all. I did a few tests and had no way of making it work other than leaving the checks where the ARRAY_CONCAT was made:

SELECT 
    ARRAY_CONCAT(
        IFNULL(listing1.items, []),
        IFNULL(listing2.items, [])
    )
[...]

This is just an hypothesis, but BigQuery does some sorts of type inference sometimes, and it might be at play here. For example when you do inserts from selects, if the first value is an INT64 it will set the column type to that, but if the first value is a FLOAT64 you'll get a float. I learned the hard way (by making mistakes) to try to CAST whenever doing SELECT into float rows of data that might contain some integer values, etc.

In the non-working-but-better-looking example, what I do is:

  1. Execute and store a few subqueries, which as mentioned at the beginning will each contain either an ARRAY of a certain struct, or whatever ARRAY_AGG returns when doesn't aggregates anything converted into an empty array []... but what's the struct of that empty array? probably NULL or simply "nothing".
  2. Afterwards, ARRAY_CONCAT those subqueries, which need to be of same "type". But seems that it I did the IFNULL check when calculating the subquery is not valid because the empty array seems to have some struct (or lack of) that doesn't matches with the other subqueries results'.

Instead, in the working-but-not-as-pretty example, what I'm doing is:

  1. Execute and store a few subqueries, each containing either an ARRAY of a certain struct, or whatever ARRAY_AGG's interpretation of NULL
  2. ARRAY_CONCAT of the subqueries, but at that precise point, detecting the ISNULL of each array aggregates and casting those that proceed into empty array []... which my intuition thinks that BigQuery is able to infer that, as long as other subqueries on that ARRAY_CONCAT return some values, they will define the expected struct and make that [] really contain the same structure, just no values.

So, in short, when executing an ARRAY_CONCAT it might be able to infer the types and make all empty arrays generated at that moment "of the same type", but if you pre-calculate each subquery's empty array value before, types don't match and you get no results.

Now, I think I could have played with CAST, or maybe try creating an empty array with the correct structure with like SELECT ARRAY<STRUCT...>[]... but it felt less natural. SQL and BigQuery heavily use NULLs, so why fighting against them? It's indeed more clear always returning an array, but if the language (or platform) makes it hard to do it that way and there's an easy alternative, I think that sometimes it's best to go for the alternative and avoid trying to outsmart a query engine (and all the engineering effort behind it).


UPDATE: Tested doing a CAST when performing the subqueries and it doesn't works! Quite interesting...

This what I tested:

WITH
    listing1 AS (
        SELECT
            IFNULL(
                ARRAY_AGG(
                    STRUCT (
                        TIMESTAMP(start) AS start,
                        TIMESTAMP(finish) AS finish
                    )
                ),
                CAST([] AS ARRAY<STRUCT<TIMESTAMP,TIMESTAMP>>)
            ) AS items
        FROM
            table1
    ),
    listing2 AS (
        SELECT
            IFNULL(
                ARRAY_AGG(
                    STRUCT (
                        TIMESTAMP(start) AS start,
                        TIMESTAMP(finish) AS finish
                    )
                ),
                CAST([] AS ARRAY<STRUCT<TIMESTAMP,TIMESTAMP>>)
            ) AS items
        FROM
            table2
    )

[...]

SELECT 
    ARRAY_CONCAT(
        listing1.items,
        listing2.items
    )
[...]
by Kartones @ 2020-11-19

Comments? Share via Twitter Share via Linkedin Share via Mastodon