Some database systems like MySQL allow to use LIMIT in UPDATE queries, which is quite handy (it also allows offsets, so things like pagination are dead simple), but not all major DB engines allow it.
There aren't so many uses for this scenario, but one is to perform batched updates in resource intensive or long running queries (to avoid exhausting all resources or getting the feared statement timeout error).
One RDBMS that doesn't allows UPDATE with LIMIT is PostgreSQL, but in those cases, there's also a solution, doing an UPDATE-SELECT:
UPDATE mytable SET field=TRUE
WHERE id IN ( SELECT id FROM mytable LIMIT 500 )
Of course that subquery can be as simple or complex as you wish. You can ORDER BY, or filter WHERE field=FALSE, or more complex rule. It's up to you how you handle the logic, but you can replicate the behaviour.
At CartoDB we're starting to face huge data imports that spawn single tables that grow to more than 5GB before even finishing being transformed and adapted to the system, so limiting the amount of rows you georeference in the same batch or avoiding running a GIS geometry transformation function upon millions of rows (which can easily last quite a few minutes) is becoming a high priority.
Applying this idea of batching queries I've come with a pretty simple but working solution to batch any import-related query. Just feed in some parameters, the query optionally specifying where two placeholders go, and let it do the magic.
What I do have fine control of which rows have already been updated is adding a temporally column to the table and drop it after the last batch finishes. It has an index (thanks for the idea @luisico) because it will be accessed a lot:
ALTER TABLE table_name ADD column_name BOOLEAN DEFAULT FALSE;
CREATE INDEX idx_column_name ON table_name (column_name);
But, while doing some tests and checking running processes I saw that postgres was autovacuuming constantly during the import, which has a big impact on performance, so I did another optimization, disabling auto-vacuuming until all import tasks are done and the table is ready:
ALTER TABLE table_name SET (autovacuum_enabled=FALSE, toast.autovacuum_enabled=FALSE)
ALTER TABLE table_name SET (autovacuum_enabled=TRUE, toast.autovacuum_enabled=TRUE)
After this changes, I was able to batches of 50,000 rows with normalizations and transformations in ~1 minute where before that same dataset was timing out with statement_timeout set to 5 minutes.
Without batching but applying the index and no autovacuum we can squeeze more "rows processed per second" but we risk (and believe me, it happens) getting a too big dataset that ends up failing, so I think it's better to go a little slower and instead make sure there's some hard limit on resources.
The task is far from over because when you remove a bottleneck usually another appears, so further steps in the import process now need to be improved to handle GB datasets and/or several million rows, but where's the fun if there is no challenge?
Bonus: Other resources I'm researching to optimize importing data to postgres are PostgreSQL Server tune-up, Background Writer checkpoints configuration and Write Ahead Log synchronization config.
Tags: Development