Kartones Blog

Be the change you want to see in this world

Python Imports 101

Python imports are not that hard once you understand how they work internally. I needed to revisit the topic recently, so not being my daily programming language any more, I think it would be interesting to write a short summary for my future self (and potential visitors).

Basics

The most common import scenarios are:

  • Module import, all module: import os
  • Module import, a submodule: import os.path as path or from os import path
  • Class/Function imports: from os.path import (abspath, dirname)

You can also see that using ... as ... you can alias imports.

Importing from a file follows the same syntax:

Given the example:

/a.py
/folder/b.py
/c.py

From c.py you can do the following:

import folder.b
import a

Clear and simple, no problems so far.

Relative vs absolute imports

Given the structure:

/src/config_folder/config.py
/src/a.py
/src/run.py

You can reference your current package/module via . (as in from . import a), and add additional dots to traverse up to parent folders, e.g. from ..config_folder import config. However, the reference point can vary, you need to have a parent module, and things can get complicated as codebases grow and you move code around.

You can also reference your modules via absolute imports, by referencing a package path: from src.config_folder import config. But we will see that this can also be a bit complex at times (hint: you probably don't want that src. prefix in the import statement).

Import resolution

The path to module imports is resolved with the following logic:

  • sys.path value
  • The script location (where the script is run from)
  • PYTHONPATH environment variable (+ info)

We shouldn't mess with sys.path, so that leaves us two choices:

  • Always use relative imports: This will help with the second case, and most IDEs support updating import paths
  • Always run the code from few entry points, and/or rely on PYTHONPATH, and use absolute imports: This is what you will need to do in certain scenarios, like running a cronjob, but I also like to enforce it for non-trivial projects, like those with multiple configurations

The single most critical point is that the import resolution (or "root") is calculated by default from the launched script location. If you run python3 /a/b/c.py, the root folder to search for import modules is going to be /a/b/.

Using the previous section example:

  • to import config.py from run.py
  • if we are going to run python3 run.py from inside /src
  • then we should do from config_folder import config, omitting the src package, because we're already inside it

If we want to namespace each subproject (common practice for example in Django projects), we'd need to arrange our code to have an additional package level, for example like:

/src/myapp/config_folder/config.py
/src/myapp/a.py
/src/myapp/run.py

And we should run python3 myapp/run.py from the src folder... But if we try, it will still give you an ModuleNotFoundError: No module named 'myapp' error, why? It errors because, if you remember, it will switch to myapp as the root folder to execute run.py. And so, this is why using PYTHONPATH always is a good approach. The following will work if run from the src folder:

$ PYTHONPATH=. python3 myapp/run.py

Examples & Conclusion

I've created examples of the three most common scenarios for absolute imports and uploaded them to my GitHub's Python miscellaneous repository:

  • Import from a file in the same folder
  • Import from a file in a sub-folder
  • Import from a file in a sibling folder

The third one is often the source of headaches.

Note that I didn't created relative import examples, because a) I find absolute imports more clear, and b) I'm used to running things almost always specifying PYTHONPATH, and very often from a container (where the entry points are also very clearly defined).


Two Testing Anti-Patterns

In the early 2000s, with Extreme Programming's high focus on testing as a critical aspect of software development, many of us were introduced to or became used to applying specific testing patterns that today are considered anti-patterns. Back then, some were not seen as bad, but often the reason was that we really had no other choice, as you mainly dealt with closed-source libraries and frameworks. Other than inside Stack Overflow, I find it hard nowadays to find some articles mentioning the topic, so here goes my contribution.

Testing Private Methods

You shouldn't do it [1]. Your public methods represent your class surface/API/interface, and private methods are implementation details; so, when testing private methods, you're coupling the test to the internal implementation details, which should be free to change with as less friction as possible.

Instead, do one of the following:

  • Refactor some of the private logic to a class, and use object composition: So you test the logic in isolation and can use a mock when testing the class that now will instantiate the refactored code
  • Focus on testing indirectly: Your goal is not 100% code coverage; your goal is testing an action, a behavior, or a concept. Focus on that and not on checking every tiny detail. Or else apply the previous point

With some languages having either poor or no encapsulation, it becomes very appealing and an easy way to "speed writing tests", but you should remember that you're breaking object-oriented encapsulation: If the method was private, it was meant not to be used directly from the outside, not even from a unit test.

In the past, we relied on either Reflection to access some private methods, or inheritance and polymorphism (when the language had good enough support) and created a child that exposed public methods to ease testing and/or mocking. But today, I advise against this and instead go for wrapping the external class and testing its public surface only. Most, if not all, scenarios can be covered by composition.

And, as a side note, for scenarios like Javascript module exports, where access modifiers are only either exported or not exported at all (and anything not exported can't be tested via the module itself), there are specific techniques, like the Testables Named Export pattern.

Mocking Class Under Test (CUT) methods

Sometimes mentioned as "System Under Test", both represent the same wrong concept: You should never mock your any class methods of the main class you're testing in a test. If you need to do it, or think that doing so would simplify the tests, that's a clear signal of a refactor waiting to materialize: refactoring to another method or a different class.

There's really not much to it: If your class does A and B, and B is a private method only called from A, either you test everything when testing A (maybe ignoring the fact that you know there's a B method), or you extract B to a separate module/class, where it is ok to test it in isolation, and make A instantiate B, and then mock B when testing A.

Final Thoughts

I've heard at times some pushback comments like "but I shouldn't rewrite my code to conform to tests". While that point can theoretically be correct, what in practice happens is that testing often surfaces problems in your existing code. It is not the cause of why you need to change your code, it is helping you identify the changes you that need to be done.

If your code were simple, then it would be easy to test.

In my opinion, tests should aim to reproduce production conditions. We are already mocking, stubbing, and faking so many things (at times maybe too many); thus, we shouldn't do yet more shortcuts.

[1]: Sample reference: Unit Testing Principles, Practices, and Patterns book


Markov Model Python Example

With our current LLM wave, which is fascinating, I've begun reading about their basics. I have a draft or two of posts with small experiments I'm doing to replicate tiny pieces of their systems (text processing is a topic that I'm not sure why but sparks my curiosity), but I remembered that not long ago, I had written a simple Markov model, a Markov chain to generate variants of the sentences found in a text.

It reads a .txt file line by line, assuming each line is a sentence, and fills a Python dictionary with the "chain" of words that forms each sentence (plus the start and end of sentence delimiters).

Using a few sentences from my sample markov_quotes.txt file:

Be the change you want to see in this world
Be the person your dog thinks you are
Everything a person can imagine, others will do

When it finishes reading, it knows it can begin a sentence with "Be" or "Everything"; If it (randomly) chooses "Be", then it has only seen the word "the" after it, so must it must follow; but the third word can either be "change" or "person"; If it chose "person", then the next word could either be "your" or "can"; and so it goes until it either picks an end of sentence delimiter, or we reach the maximum number of words per sentence we've setup.

It could generate the sentence "Be the person can imagine, others will do .". It is incorrect, but the bigger the input text you feed, the greater the variety and potential chance of generating something making more sense.

As an example with the quotes file, running it a few times, sometimes produces funny philosopher quotes:

Experience is easy .

Dont teach them like a professional is right not improving .

Be the worst .

He who seek the things will never have to control complexity not a shorter letter .

Spend your own happiness and go home .

Boy Scout Rule Always leave the happiness and go home .

Learn the best and distribute the rules like a priority .

Work hard and practice something youve never have written a marvellous thing that you don't feel like doing them to ...

He who thinks you want to think .

Try to create it wrong because nobody sees it again .

I've also included a transcript of a TED talk, which again mostly generates gibberish, but at times almost looks correct.

Not precisely groundbreaking, but fun and illustrative of some basic "brute-forcing" method of creating new text.

You can find the Python code on my GitHub.


Browser Automation via Chromium

Browser automation has advanced a lot, not only regarding the frameworks and tools but also in the most fundamental piece: the browser itself. Google Chrome is now very mature, has the biggest market share (as of mid'2023), and complies with all web standards, so it is an excellent starting point for automation projects.

In this post, I'll mention the most relevant pieces you need to set it up.


Using Google's Chromium instead of the main Chrome has two main advantages:

  • Some of the Google-specific features are removed, and any Google APIs require API keys to function, so they will be disabled by default
  • Unlike with Chrome, it is easy to get previous builds, so you're not forced to always test only with the latest version

But otherwise, it is the same browser.

There is a handy latest build link to download Chromium: https://download-chromium.appspot.com

Be aware that those builds, under Linux, come without the Widevine (DRM) compilation flag, so even if you follow the steps below, it won't work with protected content.

The ungoogled-chromium-binaries GitHub project provides Linux binaries compiled with the DRM flag. From the releases page is easy to pick either the latest version or a specific one:

https://github.com/clickot/ungoogled-chromium-binaries/releases/download/112.0.5615.165-1/ungoogled-chromium_112.0.5615.165-1.1_linux.tar.xz

An alternative site that hosts binaries for all platforms compiled with the DRM flag is: https://chromium.woolyss.com/

ChromeDriver is another critical piece, alongside an automation framework like WebDriverIO. It is easy to automate fetching a certain version via their download URLs:

https://storage.googleapis.com/chromium-browser-snapshots/Linux_x64/1109220/chromedriver_linux64.zip

As mentioned before, Chromium might come without the DRM library, Widevine. You can fetch specific versions via URLs like the following:

https://dl.google.com/widevine-cdm/4.10.2557.0-linux-x64.zip

Following the instructions provided at the chromium-widevine GitHub project set it up, which consists of extracting the files in a certain subfolder structure inside Chromium's main folder.

Another URL you'll use a lot when setting up Chromium automation is https://peter.sh/experiments/chromium-command-line-switches/, because it contains a complete list of the hundreds of command-line arguments/flags/switches. There is no official documentation, so this is really valuable.

For debugging errors thrown by the browser, you probably want to use the flags --enable-logging=stderr --v=1.

Suppose you plan to run automated browsers in a Linux environment without display (like a Docker container, or a CI instance installed without the X Server). In that case, you will probably want to use XVFB (and xvfb-run).

Finally, if you are really brave, and have some spare time, you can manually download and compile Chromium from the source code, but it is time-consuming.

UPDATE #1: Added friend suggestion of another page containing binaries for all platforms (including Linux with DRM flag).


Book Review: Pro Git

I've mentioned at least once my opinion that I would have preferred for Mercurial to have won the distributed version control systems race, because its commands were way more consistent and easy. But as of today, git has both come a long way and it it also a very powerful tool. And won the battle. So, I fully embraced git and been trying to level up lately.

I've written a git cheatsheet since quite some time, and while it still does not cover everything (and won't), I've added a bunch of new content after reading the book.

Review

Pro Git book cover

Title: Pro Git

Author(s): Scott Chacon, Ben Straub

If I had to summarize this book quickly, I'd say: If you use git, you must read it.

I've read dozens of articles and tutorials with varying difficulty levels (the hardest being at times git's own documentation). Since the first chapter, I found the explanations excellent. Everything is nicely explained, accompanied by examples, and any time the topic at hand might be non-trivial to understand, the authors will also include helpful diagrams showing branches, commits, or whatever is needed.

Need to learn about the different states a file can be (untracked, staged, committed, ...)? Check; need to learn complex strategies to bring commits from some branches to others when all of them had changes? Check; want to know how git stores commit references and even learn how to do low-level operations and other hardcore stuff? Check. To provide some context, the book is heavily based on git, and GitHub is barely mentioned here and there, so you will learn to do things in a generic but proper way and then lean on services such as GitHub or GitLab to maintain your remote repositories, user accounts and the like. But if you want, the book teaches you how to setup your own git servers (and even how the different available communication protocols work).

Over the ~520 pages, there's so much content, sometimes in so much detail, that I skipped most of the server management content. But now I know that it is explained there too, and if I need to, I can go back and check how to manage user credentials and push/pull repository permissions. I recommend picking, at minimum, all the general chapters (which can be around 50% of the book).

I wish I had read the book earlier because now I learn how git works internally, which helps me better understand any merge issue, any colleague asking "how do I xxxxx?", and how best to work with the tool.

Minor update: I just remembered mentioning another remarkable feature of the book. It is freely available for download. So there's no excuse not to give it a try.


Previous entries