If you are working in image processing and using Python as a prototyping script language to test algorithms, you might have noticed that all the libs providing fast image interpolation methods (to either sub-sample or over-sample) work in 8 bits unsigned integers (uint8). This is quite annoying if you are working with floating point images.

PIL supports floating point interpolation , but only for one layer, thus forget about RGB, and scipy.misc.imresize  is deprecated and soon to be removed.

The drawback of uint8 is it cannot store negative values, which is important if you are doing wavelets-style frequency separation. Of course, there are tricks like adding 127 and normalizing, but tricks are bad in general, and the floating point arithmetic capabilities of modern CPU make any integer arithmetic speed-up unworthy of the rounding. Let’s be clean from the ground up.

The other drawback, that people often forget, is working in uint8 always assumes that your pixels are non-linearly encoded with an OETF (Opto-electrical transfer function ), improperly called a gamma , that is meant to alleviate quantization errors  in lower bits (hence lower lights), which produce banding in smooth gradients  in low lights.

But performing convolutions, thus interpolations and linear maps, on OETF-encoded imagery does not retain the physical consistency of the pixel garbage with the scene light emission and will mess up gradients, possibly giving edge-artifacts :

You see here that the image rotated without OETF (undo the “gamma”, apply the linear map, redo the “gamma” and save) doesn’t have the dark fringe of the one rotated with the OETF on (apply the linear map and save). This will not show up in every image you will process, but it’s not because you don’t see it that it doesn’t happen. So, again, in the spirit of working cleanly, let’s work with OETF off. Which implies working in float because uint8 without OETF will fail. So, we need a good (and fast) way to interpolate float RGB images. Among all the interpolation methods, the bilinear is the most efficient of the ok-ish. It won’t give you the precision of a Lanczos 3, but it’s good for low-level image manipulation like guided filters and such.

Great ! now, the code please

Okay, okay, here is the Python :

 1import numpy as np
 2def interpolate_bilinear(array_in, width_in, height_in, array_out, width_out, height_out):
 3    for i in range(height_out):
 4        for j in range(width_out):
 5            # Relative coordinates of the pixel in output space
 6            x_out = j / width_out
 7            y_out = i / height_out
 8
 9            # Corresponding absolute coordinates of the pixel in input space
10            x_in = (x_out * width_in)
11            y_in = (y_out * height_in)
12
13            # Nearest neighbours coordinates in input space
14            x_prev = int(np.floor(x_in))
15            x_next = x_prev + 1
16            y_prev = int(np.floor(y_in))
17            y_next = y_prev + 1
18
19            # Sanitize bounds - no need to check for < 0
20            x_prev = min(x_prev, width_in - 1)
21            x_next = min(x_next, width_in - 1)
22            y_prev = min(y_prev, height_in - 1)
23            y_next = min(y_next, height_in - 1)
24
25            # Distances between neighbour nodes in input space
26            Dy_next = y_next - y_in;
27            Dy_prev = 1. - Dy_next; # because next - prev = 1
28            Dx_next = x_next - x_in;
29            Dx_prev = 1. - Dx_next; # because next - prev = 1
30
31            # Interpolate over 3 RGB layers
32            for c in range(3):
33                array_out[i][j][c] = Dy_prev * (array_in[y_next][x_prev][c] * Dx_next + array_in[y_next][x_next][c] * Dx_prev) \\
34                + Dy_next * (array_in[y_prev][x_prev][c] * Dx_next + array_in[y_prev][x_next][c] * Dx_prev)
35
36    return array_out

And here is the pure C with OpenMP pragmas to vectorize, paralellize and such :

 1static inline void interpolate_bilinear(const float *const restrict in, const size_t width_in, const size_t height_in,
 2                                        float *const restrict out, const size_t width_out, const size_t height_out,
 3                                        const size_t ch)
 4{
 5#ifdef _OPENMP
 6#pragma omp parallel for simd collapse(2) default(none) \\
 7  schedule(simd:static) aligned(in, out:64) \\
 8  firstprivate(in, out, width_out, height_out, width_in, height_in, ch)
 9#endif
10  for(size_t i = 0; i < height_out; i++)
11    for(size_t j = 0; j < width_out; j++)
12    {
13      // Relative coordinates of the pixel in output space
14      const float x_out = (float)j /(float)width_out;
15      const float y_out = (float)i /(float)height_out;
16
17      // Corresponding absolute coordinates of the pixel in input space
18      const float x_in = x_out * (float)width_in;
19      const float y_in = y_out * (float)height_in;
20
21      // Nearest neighbours coordinates in input space
22      size_t x_prev = (size_t)floorf(x_in);
23      size_t x_next = x_prev + 1;
24      size_t y_prev = (size_t)floorf(y_in);
25      size_t y_next = y_prev + 1;
26
27      // Sanitize bounds - no need to check for < 0
28      x_prev = (x_prev < width_in) ? x_prev : width_in - 1;
29      x_next = (x_next < width_in) ? x_next : width_in - 1;
30      y_prev = (y_prev < height_in) ? y_prev : height_in - 1;
31      y_next = (y_next < height_in) ? y_next : height_in - 1;
32
33      // Memory addresses of neighbouring pixels in input buffer
34      const size_t Y_prev = y_prev * width_in;
35      const size_t Y_next =  y_next * width_in;
36      const float *const Q_NW = (float *)in + (Y_prev + x_prev) * ch;
37      const float *const Q_NE = (float *)in + (Y_prev + x_next) * ch;
38      const float *const Q_SE = (float *)in + (Y_next + x_next) * ch;
39      const float *const Q_SW = (float *)in + (Y_next + x_prev) * ch;
40
41      // Distances between neighbour nodes in input space
42      const float Dy_next = (float)y_next - y_in;
43      const float Dy_prev = 1.f - Dy_next; // because next - prev = 1
44      const float Dx_next = (float)x_next - x_in;
45      const float Dx_prev = 1.f - Dx_next; // because next - prev = 1
46
47      // Interpolate over ch layers
48      float *const pixel_out = (float *)out + (i * width_out + j) * ch;
49
50      #pragma unroll
51      for(size_t c = 0; c < ch; c++)
52      {
53        pixel_out[c] = Dy_prev * (Q_SW[c] * Dx_next + Q_SE[c] * Dx_prev) +
54                       Dy_next * (Q_NW[c] * Dx_next + Q_NE[c] * Dx_prev);
55      }
56    }
57}

Notice that the C takes the number of channels as a parameter, because the function is inlined later, while the Python sets 3 channels. Be sure to compile the C code with-O3 -fopenmp -mtune=native -ffast-math to get decent performance, and use input and output image buffers aligned on 64-bits addresses so the vectorization on AVX2 platforms works.

Looping over pixels in Python, really ?

We are not done yet. Let’s launch that code and interpolate a 4872 × 3233 × 3 image to 1218 × 808 × 3 (down-scale by 16) :

 1from PIL import Image
 2
 3# load image
 4im = Image.open("../img/153412.jpg")
 5width_2 = im.width // 4
 6height_2 = im.height // 4
 7
 8# Go to normalized float and undo gamma
 9# Note : sRGB gamma is not a pure power TF, but that will do
10im2 = (np.array(im) / 255.)**2.4
11
12# Interpolate in float64
13out = np.zeros((height_2, width_2, 3))
14out = interpolate_bilinear(im2, im.width, im.height, out, width_2, height_2)
15
16# Redo gamma and save back in uint8
17out = (out**(1/2.4) * 255.).astype(np.uint8)
18Image.fromarray(out)
134.7 s ± 812 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

O.U.T.C.H. Life is too short for unoptimised code. This is the reason why you want to use Python libs only as an API of wrappers for not-so-nice-looking loop operations written in C and compiled. What Numpy actually is. But C is another low-level flavor of high-level headache and you don’t want to deal with that when prototyping. So you want to have nice-looking Python code at C speed ?

There is this awesome project, Numba , that compiles just-in-time Python to machine code through LLVM and is meant to work with Numpy. Here, since our loop is almost pure C (no fancy pythonic features), there is very little to change. If you ever used Cython, you know that it’s all the burden of pure C with all the burden of Python, and not the full performance of C. It’s actually faster to simply write and compile the C function above and wrap it in a Python module. Cython seems like a good intent started without clearly writing down the specifications before starting to fiddle. It’s messy and aweful to use. But thanks to Numba awesomeness, we will only need to replace the range function that produces the iterable object by prange, which allows a parallel loop unrolling, then add a @jit decorator before the function declaration :

1from numba import jit, prange
2
3@jit(nopython=True, fastmath=True, nogil=True, cache=True, parallel=True)
4def interpolate_bilinear(array_in, width_in, height_in, array_out, width_out, height_out):
5    for i in prange(height_out):
6        for j in prange(width_out):
7            ...

I let you google the args, but basically we turn off every Python thingy (Global Interpreter Lock, noticeably, which prevents true multiprocess execution in Python), and turn on parallelization, vectorization and loop reordering. The first time you will call the function, it will be compiled for the input types you use it for. And then, the next :

16.81 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now, if instead of 64 bits floats (Numpy’s default), you feed it 32 bits (which is still more than enough) :

13.81 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Of course, the compilation takes some time, but the compiled code is cached, so if you call the function several times, it will only be computed once. In this case, it takes an addition 2 ms to compile, on top of the figures above. It is still 9121 times faster on Intel® Xeon® CPU E3-1505M v6 @ 3.00GHz × 8, pretty great for a simple line of decorator, isn’t it ? Notice that Numba can prepare this function into a CUDA kernel  to offload it to your Nvidia GPU without additional work from you.

However, keep in mind than copying the image buffer from RAM (host) to vRAM (GPU device) is quite expensive and, for this lightweight operation, this overhead is probably not worth it. You would have to stack several filters like that (and ensure the image is copyed just at the beginning and at the end of the pipeline) to make the GPU computing worth the memory roundtrip. Also, it’s compiled for your current particular CPU architecture, so the code is well optimized for your machine, but not necessarily portable. If you want to compile ahead-of-time , to distribute the compiled bits in Python module, it’s doable too, but requires extra steps.

What about gamma removing ?

Now, you might wonder what difference it makes to interpolate with or without gamma/OETF. Let’s see :

At this size, and for this type of hazy image, it doesn’t make a noticeable difference. But, if we compute the difference between both methods, and enhance it, it becomes obvious :

image
Enhanced difference between both methods

This difference mask shows clearly that we have discrepancies around edges, especially the contrasted ones, and even in the middle of bright areas. So, be extra careful because it could blow up in your face, and in case of a doubt, always ask yourself : “what would light emissions do ?”. Respecting light transport is always the sane thing to do to get natural-looking results, because light is our visual ground truth, and it is surely not gamma-encoded. Especially here, working with a linear interpolation and only in the principal directions of the 4 interpolation vertices, we cannot account for the changes in gradients curvature that the gamma will introduce (possibly not in the principal directions).

Final thoughts

I come from scientific computation in mechanical engineering, with Matlab, Python, Simulink and the likes. Nobody taught me compiled/lower level languages like C, and I made a lot of efforts to avoid learning them and loosing my mind into pointers arithmetic and memory buffers fetching latency. For many years, I tried write optimized and fast Python relying on abstraction layers. And I failed. Until I finally had to learn C, and libs like SSE2 intrinsics and OpenCL, because of darktable.

Having extensively hacked pixel operations using parallelization and vectorization, through OpenMP pragmas and x86 vector intrinsics, with direct access to memory buffers, now I understand… I understand that, to write fast code in Python, you need to think in C and write it the same way you would in C, or call the right Numpy functions the right way, knowing what you want to offload to the internal C library, and what you can do at the pythonic level.

While projects like Numba or Cython, promising performance on-par with pure C but as easy to write as pure Python are very cool, at the end, the skill of writing actually fast code needs a low-level understanding. And since people like me learn programming the “easy” way, with interpreted languages, they will never be able to write fast code. Which is a shame, because scripting languages are the de-facto standard for most applicative layers and server-side layers.

So, at the end, I fear everyone is writing shitty code heavily under-optimized because it’s easier and nobody got time for low-level optimization in this fast-paced digital world. Which would be fine if digital did not rely on electricity to run. I fear we are wasting a lot of precious energy with all that, and people keep buying more powerful CPU just to allow devs to keep procrastinating optimization. Strange world.

Horse image courtesy of someone on darktable.fr forum. Sorry, I forgot who.