Kartones Blog

Be the change you want to see in this world

Type inference complex scenarios - A BigQuery example

Today at work I had a funny bug to triage using Google BigQuery, and documentation in this regard is scarce so I thought maybe can be of use to others, even if I'm not 100% sure of the reasoning behind.

Anyway, here it goes: In BigQuery you can have arrays of structs, and there are quite a few operations you can do with those arrays. I was extending a query that builds an array from aggregating some values into a simple struct with a few fields. The resulting aggregate (built with ARRAY_AGG) was stored (as items in the sample code below) and later used (to populate a user defined function but that's unrelevant except for the detail that it expects either a NULL or an ARRAY with the {start, finish}-like STRUCT.

The original query was something like follows:

WITH
    listing AS (
        SELECT
            ARRAY_AGG(
                STRUCT (
                    TIMESTAMP(start) AS start,
                    TIMESTAMP(finish) AS finish
                )
            ) AS items
        FROM
            table1
    )

[...]

SELECT 
    listing1.items
[...]

listing1.items can return two possible types:

  • An ARRAY of STRUCT with start and finish
  • An ARRAY of NULL instead of STRUCT, or something that gets inferred as NULL

Now, my change was to add another source of items, from a certain table2 and to be used in the same "final SELECT" at the end. BigQuery provides the ARRAY_CONCAT function to concatenate two of more arrays containing the same type inside. So my first iteration was to simply do another named subquery, and concat upon going to use them.

I found that ARRAY_AGG doesn't returns a "valid" empty array because if you try to ARRAY_CONCAT(ARRAY_AGG(...) AS a, ARRAY_AGG(...) AS b) if either a or b have no data they'll cause the full concatenation to return "something NULL". So I add a IFNULL(..., []) check and voila! it works (note that a COALESCE(..., []) would also work).

The following code is correct and produces the expected result:

WITH
    listing1 AS (
        SELECT
            ARRAY_AGG(
                STRUCT (
                    TIMESTAMP(start) AS start,
                    TIMESTAMP(finish) AS finish
                )
            ) AS items
        FROM
            table1
    ),
    listing2 AS (
        SELECT
            ARRAY_AGG(
                STRUCT (
                    TIMESTAMP(start) AS start,
                    TIMESTAMP(finish) AS finish
                )
            ) AS items
        FROM
            table2
    )

[...]

SELECT 
    ARRAY_CONCAT(
        IFNULL(listing1.items, []),
        IFNULL(listing2.items, [])
    )
[...]

Now, one of my colleages correctly pointed out that I could move the null-handling logic inside of the subquery, so that the subselect always contains/returns a valid array. Something like this:

WITH
    listing1 AS (
        SELECT
            IFNULL(
                ARRAY_AGG(
                    STRUCT (
                        TIMESTAMP(start) AS start,
                        TIMESTAMP(finish) AS finish
                    )
                ),
                []
            ) AS items
        FROM
            table1
    ),
    listing2 AS (
        SELECT
            IFNULL(
                ARRAY_AGG(
                    STRUCT (
                        TIMESTAMP(start) AS start,
                        TIMESTAMP(finish) AS finish
                    )
                ),
                []
            ) AS items
        FROM
            table2
    )

[...]

SELECT 
    ARRAY_CONCAT(
        listing1.items,
        listing2.items
    )
[...]

And when I tested the query... no results. It failed, same as if I had put no IFNULL() check at all. I did a few tests and had no way of making it work other than leaving the checks where the ARRAY_CONCAT was made:

SELECT 
    ARRAY_CONCAT(
        IFNULL(listing1.items, []),
        IFNULL(listing2.items, [])
    )
[...]

This is just an hypothesis, but BigQuery does some sorts of type inference sometimes, and it might be at play here. For example when you do inserts from selects, if the first value is an INT64 it will set the column type to that, but if the first value is a FLOAT64 you'll get a float. I learned the hard way (by making mistakes) to try to CAST whenever doing SELECT into float rows of data that might contain some integer values, etc.

In the non-working-but-better-looking example, what I do is:

  1. Execute and store a few subqueries, which as mentioned at the beginning will each contain either an ARRAY of a certain struct, or whatever ARRAY_AGG returns when doesn't aggregates anything converted into an empty array []... but what's the struct of that empty array? probably NULL or simply "nothing".
  2. Afterwards, ARRAY_CONCAT those subqueries, which need to be of same "type". But seems that it I did the IFNULL check when calculating the subquery is not valid because the empty array seems to have some struct (or lack of) that doesn't matches with the other subqueries results'.

Instead, in the working-but-not-as-pretty example, what I'm doing is:

  1. Execute and store a few subqueries, each containing either an ARRAY of a certain struct, or whatever ARRAY_AGG's interpretation of NULL
  2. ARRAY_CONCAT of the subqueries, but at that precise point, detecting the ISNULL of each array aggregates and casting those that proceed into empty array []... which my intuition thinks that BigQuery is able to infer that, as long as other subqueries on that ARRAY_CONCAT return some values, they will define the expected struct and make that [] really contain the same structure, just no values.

So, in short, when executing an ARRAY_CONCAT it might be able to infer the types and make all empty arrays generated at that moment "of the same type", but if you pre-calculate each subquery's empty array value before, types don't match and you get no results.

Now, I think I could have played with CAST, or maybe try creating an empty array with the correct structure with like SELECT ARRAY<STRUCT...>[]... but it felt less natural. SQL and BigQuery heavily use NULLs, so why fighting against them? It's indeed more clear always returning an array, but if the language (or platform) makes it hard to do it that way and there's an easy alternative, I think that sometimes it's best to go for the alternative and avoid trying to outsmart a query engine (and all the engineering effort behind it).


UPDATE: Tested doing a CAST when performing the subqueries and it doesn't works! Quite interesting...

This what I tested:

WITH
    listing1 AS (
        SELECT
            IFNULL(
                ARRAY_AGG(
                    STRUCT (
                        TIMESTAMP(start) AS start,
                        TIMESTAMP(finish) AS finish
                    )
                ),
                CAST([] AS ARRAY<STRUCT<TIMESTAMP,TIMESTAMP>>)
            ) AS items
        FROM
            table1
    ),
    listing2 AS (
        SELECT
            IFNULL(
                ARRAY_AGG(
                    STRUCT (
                        TIMESTAMP(start) AS start,
                        TIMESTAMP(finish) AS finish
                    )
                ),
                CAST([] AS ARRAY<STRUCT<TIMESTAMP,TIMESTAMP>>)
            ) AS items
        FROM
            table2
    )

[...]

SELECT 
    ARRAY_CONCAT(
        listing1.items,
        listing2.items
    )
[...]

On having good examples when learning

At the early 2000s I dropped out from my university studies for multiple reasons, but one of the major ones was a lack of motivation. I was learning theory, tons and tons of theory (and in quite a few cases, already obsolete), with some labs and assignments but usually exercises that were dull and boring. That's why I decided to start working meanwhileI studied (and finally, quitting before finishing the degree), to find that motivation and drive that I was seeking during my studies.

I've since then regretted not finishing, but not because I need the degree or anything like that. I think that a university degree shapes your brain, gives you some pillars over which to build everything else. You can learn about networks, data structures and operating systems on your own, but without the basics it won't be much different from driving a car without knowing how a combustion engine works: you use it as a tool but inside it is a black box. And I feel that in certain areas, it helps a lot to know how things work, not only how they are used.

Going back to the studies topic, one of the areas that I feel could be vastly improved is the examples and uses of each area. Sure I can memorize all the syntax and keywords and rules of ANSI C, but that way of learning C is hard and prone to me getting bored. It's way better to learn each keyword by doing some exercise, and maybe then try to build something bigger, and then something more complex, and so on. At one of my optional subjects I learned some OpenGL basics, and we were given an assignment of rendering a rotating 3D earth (with a single, nice texture). It really wasn't hard to build, but the "wow" feeling of watching that rotating 3D Earth was so great that I wished I had more assignments like it.

As I cannot go back in time and ask my teachers to give us harder but better examples and/or exercises, I'll instead just write a few cases that I came up with.


Operating systems

Learning about how operating systems work I almost only see two choices of the best and most rewarding and practical exercise: either learning MINIX, or directly learning Linux basics.

I was taught Windows NT back in the days and, while interesting on its own, I firmly think you should learn a free and opensource language available to everyone, and then decide and branch to wherever you want to go.

Digital Electronics

Logic gates, multiplexers, EPROMs, high/low voltages... There are lots of logic circuit simulators, but what I would have really loved to see is how to build an actual PC component, like a VGA graphics card. Sounds incredibly complex but simply watch those videos. Less than one hour and it will display a 100x75 32 color image via VGA cable. Simply amazing, and exemplifies that, while reality is way way more complex, the simple scenario is still cool (actually, it is awesome!) and the entry level comes to just knowing basic electronics and logic gates, nothing really advanced.

Computer Architecture

In this subject I did built some basic programs with Assembler, but it is a pity that you get to peek at so many concepts but only grasp them, having to just believe that your computer is simply a much more complex version of what you're told... or use one of those prehistoric 8086 emulators and just visually see all CPU registers and the like. If instead I was tasked with learning how the Nintendo Gameboy handled console hardware worked I would have seen a slightly more complex but still degree-level hardware design, complete with I/O, a rough graphics system, RAM and ROM...

When I learned about the Gameboy internals a few years ago I was surprised that I could understand mostly everything. Its Assembler instruction set is reduced and simple, all the schematics, CPU cycles cost per instructions, interruptions, memory addressing ranges,... everything is published (~ 200 pages of a single PDF reading), and there are lots of emulators (even Javascript ones) that you can use to see it in action and debug how it works.

Compilers

You're learning how a compiler works, how to translate to machine code, building an abstract syntax tree, what's a symbols table, etc., so to me, the best example to properly learn the subject is to build a compiler for a simple language.

This one I actually semi-did: back in the early 2000s we were tasked with building a kinda-Assembler to real Assembler compiler in C. "Transpiler" would be more correct in today's terms, as the initial output was a .ASM, but still a good exercise.

Maths

I wish I knew more maths, but never take the time to dive deeper except for specific formulas. A good example to encourage someone to study them, at least for me would be building a pretty simple software-based 3D renderer, then adding some rough physics. Even if it's just rendering wireframe polygons without textures or colors, something like that would have already been quite appealing. In general anything game-development related ends up touching some area of maths so binding both seems interesting.


So there it goes, I'm pretty sure there are better examples, as mine are way too biased towards the topics I like, but... you can get the idea 😉


Book Review: WRONG! Retro Games, You Messed Up Our Comic Book Heroes!

Review

WRONG! Retro Games, You Messed Up Our Comic Book Heroes! book cover

Title: WRONG! Retro Games, You Messed Up Our Comic Book Heroes!

Author(s): Chris Baker (Author), Matthew Waite (Illustrator)

I picked up this small ~150 pages ebook because most compilation of videogames are "best of": Best Nintendo games, best first-person shooters ever, best strategy games from the 90s... But a book about games that got superheroes wrong? Not only I'm in, I'll probably know one or two!

And indeed I was right: From a purple NES Batman, to a can't fly Superman or un-recognizable super-villains, the book nails it with the (few) titles that I knew, plus a lot more. Written in a humorous tone, stating one or two facts of a game, then explaining why it's incorrect, and filling it with additional details (or a few times, mentioning why despite the flaws the game is fine), I liked the structure, the content (a must to read in a color device like a tablet to appreciate the pictures) and the videogames chosen.

My only complaint is that you will finish the book so quickly you'll want more. A fun recommended read.


ACID, BASE and CALM

Most developers will be familiar with the acronym ACID, which stands for Atomicity, Consistency, Isolation & Durability regarding database transactions. ACID is well known and applicable for single database instances, but cannot be uphold for distributed transactions. Regarding the CAP theorem, it chooses Consistency and Partition tolerance. Building an application with ACID capabilities mostly requires just wrapping critical paths under transactions and properly configuring the isolation level of the database system.


With the NoSQL movement appeared another acronym, BASE: Basically Available, Soft state, Eventual consistency. It provides the Availability and Partition Tolerance from CAP, but eventual consistency requires deeper architectural and even product changes in any system (e.g. your clients can see stale or transitional data, or fail to see recently created data units).


Today while reading about distributed systems and consistency I came across a third acronym, CALM. Apart from the obvious jokes around "keep CALM and ...", while far from easy as the article "Keeping CALM: When Distributed Consistency Is Easy" implies, it is interesting regarding two key points:

  • It proposes that "A problem has a consistent, coordination-free distributed implementation if and only if it is monotonic". By reasoning about program outcomes, rather than mutations to storage, and avoiding linearity and ordering, theoretically we can build software that doesn't needs to coordinate at all (with the concept of "confluence" to a final correct result).
  • It observes that "Coordination-freeness is equivalent to availability under partition", thereby fulfilling all 3 CAP properties for those cases where problems are monotonic.

So, if we found a solution to distributed agreement, why is not so known? Where's the catch? Well, non-linearity and confluent operations are very hard to implement with imperative languages, they are more apt for functional programming. Up to the point that the theorem authors created the Bloom programming language, with the purpose of build distributed systems software. So here the cost of implementation is as huge as probably requiring a full rewrite of parts of the system (and of course in a functional paradigm).


I left on purpose well-known consensus protocols like 2PC, Two-Phase Commit, and the Paxos family, because they bet strongly on agreement mechanics to achieve CAP, while BASE and CALM propose and/or provide approaches that that "skip" or workaround consensus.

But I wanted to mention CALM also because the ACM article represents an interesting journey on traditional distributed systems problems, with examples using familiar scenarios like a distributed shopping cart. I'd like to keep at hand a list of terms and a few notes about the topic, so this small post does the job.


FIRE (Financial Independence, Retire Early) Misconceptions

I recently came across FIRE: Financial Independence, Retire Early, got interested and reading about it found some misleads and not-so-true claims, not in the theory but in posts and articles about "success stories", and wanted to dump my thoughts on the subject.

Most definitions (sample) include the two critical pillars that support this movement: savings and investment. FIRE is about earning money, then spending part of it to somehow have passive income source(s), so that you don't need to work. To not actively work at least.

And precisely in the "passive" adjective is where I see the problem, because seems to be either confused or just switched with self-employment (probably to click-bait, because profiting from blog posts or related books is an income source too 😉). Let me showcase two examples:

a) Stock Investing: If you invest passively, you contract an investor, a fund or any similar broker that deals with your assets and generates you income. If you actively choose the stock, what to buy, when to sell, check markets, etc., then that's actual work, it's active work. You became a small stock trader.

b) Real state renting: If you invest passively, you buy a house (or more), then contract someone to handle the management (repairs, finding tenants, etc.) for a fee. If you're the one in charge of any problem/repair, even if it's just calling the insurance company, then it's not passive, it's an active job (a property management services company of one). This is easy to notice if you get more work each property you add, if it's passive shouldn't have any effect other than regarding money.

Yes, I've read there are variants of FIRE in which you keep a part-job, but then that's not early retirement, it's partial retirement.


Until recently I had for a few years a small property that I rented and, while it mostly required low attention, when there were issues it did require time and effort, and sometimes totally understand why there exists services that handle these scenarios and wonder if should have taken that path. I didn't just because overall was easy to handle and near my home.

This is just my subjective interpretation, but few things in life are really free.

Freedom to spend your time while earning income usually has a cost, often in the form of somebody taking care of some assets, ensuring an income and dealing with any issue that might happen.


Previous entries