Code Fat

musings by Bast on Friday March 14th, 2025

Here's a term for all of those of you discussing optimization, premature optimization, and bloat: code fat.

What is code fat?

Code Fat (noun):

"good" unoptimized code.
Unoptimized code that consumes a small fraction of actual time (compute and dev)
Clearly and concisely written code that loses microbenchmarks but does not adversely affect overall runtime

Pretty straightforwards. Python is full of these.

Code Fat:

results = []
for item in list:
    results.append(item)

Athletic code:

list_len = len(list)
results=[None]*list_len
i = 0
while i < list_len:
    results[i] = list[i]
    i += 1

It's faster, technically, and somewhat more so on larger list sizes (where the results list will need to repeatedly reallocate to fit more items). But it's not that much faster. Let's benchmark it (and also the list comprehension option):

import timeit

setup = """
list = [*range(100_000)]
"""

fat = """
results = []
for item in list:
    results.append(item)
"""

athletic = """
list_len = len(list)
results=[None]*list_len
i = 0
while i < list_len:
    results[i] = list[i]
    i += 1
"""

comprehension = """
results = [i for i in list]
"""

for code_option in (fat, athletic, comprehension):
    timer = timeit.Timer(setup=setup, stmt=code_option)
    print(code_option)
    timer.autorange(lambda loops, time: print(loops, time))

Results:

results = []
for item in list:
    results.append(item)

1 0.005206510999999997
2 0.012324658000000016
5 0.03544393699999998
10 0.09226821899999998
20 0.11041564699999995
50 0.28065099400000004

list_len = len(list)
results=[None]*list_len
i = 0
while i < list_len:
    results[i] = list[i]
    i += 1

1 0.008154285000000039
2 0.016715161000000034
5 0.04302381600000005
10 0.08488470700000006
20 0.18559788499999996
50 0.43635810099999994

results = [i for i in list]

1 0.0019061910000000015
2 0.0033083730000000866
5 0.009635229999999995
10 0.019966584999999926
20 0.039996417000000006
50 0.09783333299999986
100 0.21378716599999992

This is on an older intel laptop with some other things running, so it might be a little noisy.

Fat: 50 iterations in 0.280s
Athletic: 50 iterations in 0.436s
Comprehension: 50 iterations in 0.096s

blink

Wait, the athletic code is slower than a simple for loop?

It is, apparently. I ran it a few more times and got some significant variance (+/- ~20%) but that's nothing compared to the gulf in performance between the three different options.

On the face of it this may make no sense. The athletic option does significantly more efficient work. It doesn't lazily append to a list until it's full. It preallocates space. It uses a predictable iterator that can be vectorized–

Python doesn't vectorize loops.

"That's a contrived example" perhaps, but it's also an example for a different point than the one I stated out front. Athletic code has many weakness, not the least of which is that it's difficult to get right. Time spent writing complex code that is theoretically faster becomes wasted when the reality reveals that it is not faster, and staggeringly (2x) so. The example is c-style optimized in a way that does not work in python (which I'll write about why that is at the end of this post). This is on purpose: many people who lean on speed and optimization come from languages where this is important.

I wouldn't say that optimizing python is unimportant, but python trains a different subset of optimization compared to most languages. In python, the ability to offload computation to optimized C or BLAS code is paramount. The same thing done in python versus C will have drastic effects on performance:

words = ["bob", "jerry", "jill", "ox"]*10_000

counts = {}
for word in words:
    if word not in counts:
        counts[word] = 1
    else:
        counts[word] += 1

print(counts)

Results:

Python loop: 100 iterations in 0.353s
Counter(): 100 iterations in 0.160s

Again, somewhat contrived. But Counter() gets you more, is clearer, and is faster. There's not always an easy solution, but frequently there's code you can leverage. And if there isn't, the python C api isn't as bad as you'd think. ctypes makes it almost trivial to translate structures and call C library DLLs in just pure python, no specialized build environment or tooling neccessary. Of course, such things do exist and do a lot of legwork for complex cases. PyO3/maturin make this as simple as copy-paste.

This is python's superpower. Not as glue (although this is, in a sense, being a glue language) but in the ability to segment your code base between hyperoptimizeable languages and simple straightforwardness according to your needs.

Now you may have noticed that I've been ignoring list comprehensions. These serve a similar role as to Linq in C#. They function as a general map/filter syntax:

result = [mapped(item) for item in source if filter(item)]

They're faster than for loops because they don't invoke all of the relevant machinery. A list comprehension cannot construct a list larger than the one it's given–this allows python to "read through" and preallocate space. It also doesn't support break statements, or complex control flow:

>>> a = [(b := i) for i in range(3)]
  File "<stdin>", line 1
    a = [(b := i) for i in range(3)]
            ^
SyntaxError: invalid syntax

This allows it to be special cased. Of course, that also means that developers are tempted by the speed. Make your for loops faster by wrapping them in a comprehension? Technically possible.. but this frequently leads to overly athletic code. It's possible to nest for loops in a comprehension:

a = [
    i
        for b in (range(3), range(2), range(4))
            for i in b
]

Don't do it. It's awkward, the syntax is backwards (you would expect i for i in b for b in but that's invalid), and it rapidly leads to obtuse, unmaintainable code. Comprehensions should be simple and immediately apparent. Look at the stuff I've run into!

factor = lambda n: [div
        for ln in [[n]]
        for _, div in zip(iter(lambda: ln[0], 1), range(2,n+1))
        for _ in iter(lambda: ln[0] % div == 0 and ln.__setitem__(0,ln[0] // div), False)][::-1]

# unrolled
def factor(n):
    for i in range(2, n):
        while n % i == 0:
            n //= i
            yield i
        if n == 1:
          break

( Stolen from the pydis esoteric-python channel. No codebases were harmed in the creation of this challenge answer… I hope )

I can "Tell from looking at it" that the top version is slower, but I'm writing a blog post so let's spend the 5 minutes it takes to test it:

One-line comprehension: 10_000 iterations in 0.267s
Generator function: 10_000 iterations in 0.067s

It's always important to make sure your intuitions are right or wrong, that's how you train them.

As for why, the one-liner is clearly more complicated. It manually calls __setitem__ to get around the inability to set indices of a list in a comprehension, makes a list three times (once for the comprehension, backwards, then inverting it, then one internally to manage the intermediary values). Can you optimize this? Almost certainly. Should you? Absolutely not.

You should take the free win that writing it the simple way gets you. And if that's not enough, well, there's numpy or in this case sympy, which leverages community effort and C integration to do this faster. Or, well, after you test it and realize it's not really faster… because it uses pretty much the same algorithm:

            factors = {}
            m = 2 # to initialize the if condition below
            for p in sieve.primerange(2, x + 1):
                if m > 1:
                    m, q = 0, x // p
                    while q != 0:
                        m += q
                        q //= p
                factors[p] = m

The only big differences are in output format (which may or may not win for your use case) and using prime numbers for the computation. The prime number calculation is an iterator that implements the sieve of Eratosthenes. Faster, but not on small numbers due to the several layers of overhead. You, of course, can just implement this part in rust:

use pyo3::prelude::*;

#[pyfunction]
fn factor_int(n: u128) -> PyResult<PyDict> {
    let result = PyDict::new();

    for i in 2..n {
        let count = 0;

        while n % i == 0 {
            n /= i;
            count += 1;
        }

        if count != 0 {
            result.set_item(i, count)
        }

        if n == 1 {
            break
        }
    }

    Ok(result)
}

#[pymodule]
fn rust_factor(m: &Bound<'_, PyModule>) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(factor_int, m)?)
}

( I haven't actually compiled this, user beware )

Import it with import rust_factor and then call rust_factor.factor_int(your_number). Of course, you do lose the ability to factorize numbers greater than whatever py03 supports–either 18446744073709551616 or 340282366920938463463374607431768211456–but usually that's not a big loss. And I'm sure there's some way to parse the data out of a PyInt without using the readily-available autoformat that you could write if you needed it, as well as the prime iterator, etc etc.

This is lean code. Not athletic code. It doesn't build muscle for show. It has everything it needs for what it needs, as measured and benchmarked in the real world and from projected uses. It's fast, but only through the appropriate leveraging of the tools at hand and not by contorting itself to take advantage of syntactical technicalities or performance hacks.

Write lean code. And when you can't afford the time to write it out fast, write fat code instead. The simple for loop can be lowered into C or rust with a few minutes work when you determine it's needed, and not before. And in the meantime it keeps you warm and comfortable. Like a layer of fat, it's there for the bad times.

Why is that while loop slower, anyways?

Here's the disassembly for the example code above:

results = []
for item in list:
    results.append(item)

  2           0 BUILD_LIST               0
              2 STORE_NAME               0 (results)

  3           4 SETUP_LOOP              22 (to 28)
              6 LOAD_NAME                1 (list)
              8 GET_ITER
        >>   10 FOR_ITER                14 (to 26)
             12 STORE_NAME               2 (item)

  4          14 LOAD_NAME                0 (results)
             16 LOAD_METHOD              3 (append)
             18 LOAD_NAME                2 (item)
             20 CALL_METHOD              1
             22 POP_TOP
             24 JUMP_ABSOLUTE           10
        >>   26 POP_BLOCK
        >>   28 LOAD_CONST               0 (None)
             30 RETURN_VALUE


# As an aside, we can speed up the above loop by binding "append = results.append" and calling it directly, which skips a lookup. But that's jank optimization tech.

list_len = len(list)
results=[None]*list_len
i = 0
while i < list_len:
    results[i] = list[i]
    i += 1

  2           0 LOAD_NAME                0 (len)
              2 LOAD_NAME                1 (list)
              4 CALL_FUNCTION            1
              6 STORE_NAME               2 (list_len)

  3           8 LOAD_CONST               0 (None)
             10 BUILD_LIST               1
             12 LOAD_NAME                2 (list_len)
             14 BINARY_MULTIPLY
             16 STORE_NAME               3 (results)

  4          18 LOAD_CONST               1 (0)
             20 STORE_NAME               4 (i)

  5          22 SETUP_LOOP              32 (to 56)
        >>   24 LOAD_NAME                4 (i)
             26 LOAD_NAME                2 (list_len)
             28 COMPARE_OP               0 (<)
             30 POP_JUMP_IF_FALSE       54

  6          32 LOAD_NAME                1 (list)
             34 LOAD_NAME                4 (i)
             36 BINARY_SUBSCR
             38 LOAD_NAME                3 (results)
             40 LOAD_NAME                4 (i)
             42 STORE_SUBSCR

  7          44 LOAD_NAME                4 (i)
             46 LOAD_CONST               2 (1)
             48 INPLACE_ADD
             50 STORE_NAME               4 (i)
             52 JUMP_ABSOLUTE           24
        >>   54 POP_BLOCK
        >>   56 LOAD_CONST               0 (None)
             58 RETURN_VALUE


results = [i for i in list]

  2           0 LOAD_CONST               0 (<code object <listcomp> at 0x10c2db540, file "<dis>", line 2>)
              2 LOAD_CONST               1 ('<listcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_NAME                0 (list)
              8 GET_ITER
             10 CALL_FUNCTION            1
             12 STORE_NAME               1 (results)
             14 LOAD_CONST               2 (None)
             16 RETURN_VALUE

Disassembly of <code object <listcomp> at 0x10c2db540, file "<dis>", line 2>:
  2           0 BUILD_LIST               0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                 8 (to 14)
              6 STORE_FAST               1 (i)
              8 LOAD_FAST                1 (i)
             10 LIST_APPEND              2
             12 JUMP_ABSOLUTE            4
        >>   14 RETURN_VALUE

Alright, so there's a lot to see here. The simple loop has relatively simple assembly: iterate over the loop, load the list, load the method, call the method, repeat. The listcomp is split into two parts: the function itself, and then the comprehension as a separate code block within. The comprehension itself shows the signs of what I discussed above: it's far simpler and uses _FAST instruction variants (which shortcut to local name resolution) the entire time. LIST_APPEND is another fast opcode used for this case, skipping the CALL_METHOD need.

Okay. So why is the second option slow?

Because of all the extra work it's doing. Comparison, addition, method calls are not as cheap in python as they are in C. They go through the same type resolution and method chain as everything else. We load list on every loop, and we fetch and set i a dozen times. Sure, it doesn't cost much individually, but when compared to FOR_ITER it's quite slow. A near-identical slowdown occurs even with just an empty counting loop:

for i in range(50):
    pass

while i < 50:
    i += 1

Range: 1000 iterations in 0.089s
While: 1000 iterations in 0.271s

# Range
  2           0 SETUP_LOOP              16 (to 18)
              2 LOAD_NAME                0 (range)
              4 LOAD_NAME                1 (n)
              6 CALL_FUNCTION            1
              8 GET_ITER
        >>   10 FOR_ITER                 4 (to 16)
             12 STORE_NAME               2 (i)

  3          14 JUMP_ABSOLUTE           10
        >>   16 POP_BLOCK
        >>   18 LOAD_CONST               0 (None)
             20 RETURN_VALUE

# While

  2           0 LOAD_CONST               0 (0)
              2 STORE_NAME               0 (i)

  3           4 SETUP_LOOP              20 (to 26)
        >>    6 LOAD_NAME                0 (i)
              8 LOAD_NAME                1 (n)
             10 COMPARE_OP               0 (<)
             12 POP_JUMP_IF_FALSE       24

  4          14 LOAD_NAME                0 (i)
             16 LOAD_CONST               1 (1)
             18 INPLACE_ADD
             20 STORE_NAME               0 (i)
             22 JUMP_ABSOLUTE            6
        >>   24 POP_BLOCK
        >>   26 LOAD_CONST               2 (None)
             28 RETURN_VALUE

In a language like python, offloading effort to the interpreter is almost always worth it. The interpreter, after all, is running an entire layer of abstraction lower. At the very least the stack is being exercised less. In the for loop example, i isn't even being read!