Kartones Blog

Be the change you wanna see in this world

Four Horsemen of the Python Apocalypse

Four Horsemen of the Apocalypse

I think I've found the four horsemen of the Apocalypse in the python world. A combo that, while will cause pain and destruction at first, will also leave afterwards a much better codebase, stricter but uniform and less prone to certain bugs.

Who are these raiders?

mypy: Not a newcomer to my life (see I & II). Each day I'm more convinced that any non-trivial python project should embrace type hints as self-documentation and as a safety measure to reduce wrong typing bugs.

flake8: The classic, so useful and almost always customized, losing some of its power when used alone. Still certainly useful, just needs to be configured to adapt to black.

isort: Automatically formats your imports. By itself supports certain settings, but should also be configured to please black rules.

black: The warmonger. Opinionated, radical, almost non-configurable, but pep-8 compliant and with decent reasonings about each and every rule it applies when auto-formatting the files. It will probably make you scream in anger when it first modifies all files, even some you didn't knew your project had, even django migrations and settings files 🤣... But it is the ultimate tool to cut out nitpickings and stupid discussions at pull request reviews. Everyone will be able to focus on reviewing the code itself instead of how it looks.

pre-commit: isort and black are meant to run with either this tool or a similar one, instead of as a test (black even ignores stdout process piping). After some experiments, the truth is that makes more sense to keep auto-formatters at a different level than test runners and linters, and as flake8 will also fail the pre-commit hook, I decided to move everything except mypy to pre-commit.


Go programming language has, among other things, taken a great step by making a great decision: It provides one official way to format your code, and it does fix the formatting automatically by itself (instead of emitting warnings/errors).

I was reluctant to try black and isort because I was worried of the chaos they can cause. But again, checking code often means coding style discussions here and there, so encouraged by a colleage I decided to try it both at work (in a softer and more gradual way) and at home (going all in. Almost everybody will hate at least one or two changes it automatically performs, but it leaves no more room for discussion, as you can only configure the maximum line length. period.

I ran black through my whole project, but else they only format created and modified files, which is good for big codebases.


It takes some time to configure all of the linters and formatters until you're able to do a few sweeps and finally commit, so here are my configuration values:

Mypy runs as a linter test, but the other three are setup as pre-commit hooks inside .pre-commit-config.yaml.


Bulk Queries in MySQL vs PostgreSQL

I lately read a non-trivial amount of code diffs almost on a daily basis, so I'm learning a thing or two not only via the code itself, but also via the decisions taken and the "why"s of those decisions.

A recent example that I queried about was the following: You notice there's a DB query that causes a MySQL deadlock timeout. The Query operates over a potentially big list of items, and the engineer decided to split it into small sized chunks (let's say 10 items per chunk). [1]

My knowledge of MySQL is pretty much average; I know the usual differences between MyISAM and InnoDB, a few differences regarding PostgreSQL and not much more. And I consider I still know more about PostgreSQL than MySQL (although I haven't actively used PG since 2016). But in general what I've often seen, learned and have been told is to go for one bulk query instead of multiple individual small ones: You make less calls between processes and software pieces, less data transformations, the query planner can be smarter as knows "the full picture" of your intentions (e.g. operate with 1k items) and, who knows, maybe the rows you use have good data locality and are stored contiguously in disk or memory so they get loaded and saved faster. It is true you should keep your transactions scoped to the smallest surface possible, but at the same time the cost of opening and closing N transactions is bigger than doing it a single time, so there are advantages in that regard too.

With that "general" SQL knowledge, I went and read a few articles about the topic, and asked to the DB experts "Unlike other RDBMS, is it better in MySQL to chunk big queries?" And the answer is yes. MySQL's query planner is simpler than PostgreSQL's by design, and as JOINs sometimes hurt, a way to get some extra performance is delegating joining data to the application layer, or transforming the JOIN(s) into IN(s). So, to avoid lock contention and potential deadlocks, it is good to split into small blocks potentially large, locking queries, as this way other queries can execute in between. [2]

I also learned that, when using row-level locking, InnoDB normally uses next-key locking, so for each record it also locks the gap before it (yes, it's the gap before, not after). [3]


This differentiation is very interesting because it affects your data access patterns. Despite minimizing transaction scope, ensuring you have the appropriate indexes in place, tuning up the query to be properly built, and other good practices, if you use MySQL transactions you need to take into account lock contention (more frequently than with other engines, not that you won't cause them with suboptimal queries anywhere else).

A curious fact is that this is the second time that I find MySQL being noticeably different from other RDBMS. Using Microsoft's SQL Server first, and then PostgreSQL, you are always encouraged to use stored routines (stored procedures and/or stored functions) because of the benefits they provide, one of them being higher performance. With MySQL even a database trigger hurts performance, and everybody avoids stored procedures because they perform worse than application logic making queries [4]. As for the why, I haven't had time nor the will to investigate.

References:

[1]: Minimize MySQL Deadlocks with 3 Steps

[2]: What is faster, one big query or many small queries?

[3]: InnoDB Transaction Model and Locking

[4]: Why MySQL Stored Procedures, Functions and Triggers Are Bad For Performance


Course Review: 300+ Phrasal Verbs (Udemy)

300+ Phrasal Verbs - Spoken English Vocabulary 4 Conversation is yet another 3 hours Udemy course I just finished. As the name is quite informative (although a bit long, maybe for SEO reasons), no need to explain much about the intention, but we can talk about the form and the content. I don't aspire to see high-quality production values (some other courses do great with a normal HD camera + a virtual screen or a simple slidedeck), but this course gets sometimes way too informal. From uncorrected mistakes (left as jokes) or sentence "slides" being removed so quickly they are impossible to read, to 1/4 of the content being "live" videos of mediocre quality (lower quality audio, laptop fan noise noticeable, hardcoded subtitles...), it looks unprofessional. Also a few sentences are too niche, two or three are not even really phrasal verbs (said by the author himself), and there are some repetitions between the normal contents and the "live" last quarter.

It is still interesting to practice phrasal verbs as there are many, but I'd rather have a more serious "teacher" that cares a little about the production values.


Always send emails asynchronously

I still see from time to time, and even recently heard at a podcast, the advice of sending emails synchronously after certain relevant actions. I knew this old approach was widely used back in the phpBB and phpNuke days, where you couldn't do much asynchronous work in the web and had to do everything in the same process, thread and even same HTTP request. I thought that was already forgotten and clearly stated to be a bad practice. In my opinion it is only fine for your pet-projects, hackathon-based ideas, and under no circumstances should be done for production-like work, even if it is just a prototype.

Using an example of a new user registration flow, let's quickly see the ways of sending a welcome email:

a) Synchronous email sending after new user registered action. This is bad, as it makes the whole register take more time (in the order of seconds), it makes the registration logic more complex, etcetera. No matter if you do it with a function, with a "post register handler" or whatever you call it, if the execution is sequential and inside the same process, it is not good. The only good thing it does is that it provides all the required info to the mail sending logic without any need to query other services.

b) Enqueue asynchronous task to send email, including the new user id. This is good, as the user registration flow will finish quickly, before the email is sent. But could be improved, as it needs to pull user data from wherever it resides when actually executing the email sending task, so there will be extra calls inside your platform.

c) Enqueue asynchronous task to send email, including all required user data. If the "task queueing" message has all info it requires (user id, email, first and last name are probably enough), then there's no need for the mailing service/tasks processor/etc. to even know where the user data lives. No extra service calls, no delays or timeouts... as long as the queue message exists, it can be just consumed and deleted without any additional piece (if goes well, of course).

d) With event sourcing, pub/sub, or any similar event-based architecture, user registration would emit a UserRegisteredEvent and, assuming it was registered as a subscriber, our mailing/notification service would pick it up and sent the corresponding email. This case can act like scenario b) if the notifications service needs to pull the data from somewhere else, but we could also send the relevant fields on the message so it becomes scenario c).

In the end, there's one requirement and one decision: you must defer execution of email sending, but you get to choose if you wish lightweight event messages with few data fields and extra communication steps between services, versus smarter messages containing the all fields you expect other services will need to act in response of, plus extra calls.

Oh, and if it is "critical", then setup a high-priority messaging queue for certain emails, mark the event as critical=True so notifications service can properly move it to the top of the queue or process it immediately, or any other architectural change you can come around.

Just don't do a) 😉


Course Review: American English Pronunciation (Udemy)

Another small 3 hours course I've finished recently 🤓 American English Pronunciation has a quite self-explicative name, and delivers what it tells: You'll get a lot of advices on how to speak, from general short and long vowel sounds, consonant sounds, to hard scenarios like diphthongs, "S and Z" or "CH and S", among others.

These small courses are starting to be quite valuable to me, because for the cost of one or two English lessons you get "focused lessons" on topics you might want to learn, improve or simply revisit. In this case it helped me to practice pronunciation and repeated and exemplified catchy cases.


Previous entries