If you are working in image processing and using Python as a prototyping script language to test algorithms, you might have noticed that all the libs providing fast image interpolation methods (to either sub-sample or over-sample) work in 8 bits unsigned integers (uint8
). This is quite annoying if you are working with floating point images.
PIL supports floating point interpolation , but only for one layer, thus forget about RGB, and scipy.misc.imresize is deprecated and soon to be removed.
The drawback of uint8
is it cannot store negative values, which is important if you are doing wavelets-style frequency separation. Of course, there are tricks like adding 127 and normalizing, but tricks are bad in general, and the floating point arithmetic capabilities of modern CPU make any integer arithmetic speed-up unworthy of the rounding. Let’s be clean from the ground up.
The other drawback, that people often forget, is working in uint8
always assumes that your pixels are non-linearly encoded with an OETF (Opto-electrical transfer function ), improperly called a gamma , that is meant to alleviate quantization errors in lower bits (hence lower lights), which produce banding in smooth gradients in low lights.
But performing convolutions, thus interpolations and linear maps, on OETF-encoded imagery does not retain the physical consistency of the pixel garbage with the scene light emission and will mess up gradients, possibly giving edge-artifacts :
You see here that the image rotated without OETF (undo the “gamma”, apply the linear map, redo the “gamma” and save) doesn’t have the dark fringe of the one rotated with the OETF on (apply the linear map and save). This will not show up in every image you will process, but it’s not because you don’t see it that it doesn’t happen. So, again, in the spirit of working cleanly, let’s work with OETF off. Which implies working in float
because uint8
without OETF will fail. So, we need a good (and fast) way to interpolate float
RGB images. Among all the interpolation methods, the bilinear is the most efficient of the ok-ish. It won’t give you the precision of a Lanczos 3, but it’s good for low-level image manipulation like guided filters and such.
Great ! now, the code please
Okay, okay, here is the Python :
And here is the pure C with OpenMP pragmas to vectorize, paralellize and such :
Notice that the C takes the number of channels as a parameter, because the function is inlined later, while the Python sets 3 channels. Be sure to compile the C code with-O3 -fopenmp -mtune=native -ffast-math
to get decent performance, and use input and output image buffers aligned on 64-bits addresses so the vectorization on AVX2 platforms works.
Looping over pixels in Python, really ?
We are not done yet. Let’s launch that code and interpolate a 4872 × 3233 × 3 image to 1218 × 808 × 3 (down-scale by 16) :
O.U.T.C.H. Life is too short for unoptimised code. This is the reason why you want to use Python libs only as an API of wrappers for not-so-nice-looking loop operations written in C and compiled. What Numpy actually is. But C is another low-level flavor of high-level headache and you don’t want to deal with that when prototyping. So you want to have nice-looking Python code at C speed ?
There is this awesome project, Numba , that compiles just-in-time Python to machine code through LLVM and is meant to work with Numpy. Here, since our loop is almost pure C (no fancy pythonic features), there is very little to change. If you ever used Cython, you know that it’s all the burden of pure C with all the burden of Python, and not the full performance of C. It’s actually faster to simply write and compile the C function above and wrap it in a Python module. Cython seems like a good intent started without clearly writing down the specifications before starting to fiddle. It’s messy and aweful to use. But thanks to Numba awesomeness, we will only need to replace the range
function that produces the iterable object by prange
, which allows a parallel loop unrolling, then add a @jit
decorator before the function declaration :
I let you google the args, but basically we turn off every Python thingy (Global Interpreter Lock, noticeably, which prevents true multiprocess execution in Python), and turn on parallelization, vectorization and loop reordering. The first time you will call the function, it will be compiled for the input types you use it for. And then, the next :
Now, if instead of 64 bits floats (Numpy’s default), you feed it 32 bits (which is still more than enough) :
Of course, the compilation takes some time, but the compiled code is cached, so if you call the function several times, it will only be computed once. In this case, it takes an addition 2 ms to compile, on top of the figures above. It is still 9121 times faster on Intel® Xeon® CPU E3-1505M v6 @ 3.00GHz × 8, pretty great for a simple line of decorator, isn’t it ? Notice that Numba can prepare this function into a CUDA kernel to offload it to your Nvidia GPU without additional work from you.
However, keep in mind than copying the image buffer from RAM (host) to vRAM (GPU device) is quite expensive and, for this lightweight operation, this overhead is probably not worth it. You would have to stack several filters like that (and ensure the image is copyed just at the beginning and at the end of the pipeline) to make the GPU computing worth the memory roundtrip. Also, it’s compiled for your current particular CPU architecture, so the code is well optimized for your machine, but not necessarily portable. If you want to compile ahead-of-time , to distribute the compiled bits in Python module, it’s doable too, but requires extra steps.
What about gamma removing ?
Now, you might wonder what difference it makes to interpolate with or without gamma/OETF. Let’s see :
At this size, and for this type of hazy image, it doesn’t make a noticeable difference. But, if we compute the difference between both methods, and enhance it, it becomes obvious :
This difference mask shows clearly that we have discrepancies around edges, especially the contrasted ones, and even in the middle of bright areas. So, be extra careful because it could blow up in your face, and in case of a doubt, always ask yourself : “what would light emissions do ?”. Respecting light transport is always the sane thing to do to get natural-looking results, because light is our visual ground truth, and it is surely not gamma-encoded. Especially here, working with a linear interpolation and only in the principal directions of the 4 interpolation vertices, we cannot account for the changes in gradients curvature that the gamma will introduce (possibly not in the principal directions).
Final thoughts
I come from scientific computation in mechanical engineering, with Matlab, Python, Simulink and the likes. Nobody taught me compiled/lower level languages like C, and I made a lot of efforts to avoid learning them and loosing my mind into pointers arithmetic and memory buffers fetching latency. For many years, I tried write optimized and fast Python relying on abstraction layers. And I failed. Until I finally had to learn C, and libs like SSE2 intrinsics and OpenCL, because of darktable.
Having extensively hacked pixel operations using parallelization and vectorization, through OpenMP pragmas and x86 vector intrinsics, with direct access to memory buffers, now I understand… I understand that, to write fast code in Python, you need to think in C and write it the same way you would in C, or call the right Numpy functions the right way, knowing what you want to offload to the internal C library, and what you can do at the pythonic level.
While projects like Numba or Cython, promising performance on-par with pure C but as easy to write as pure Python are very cool, at the end, the skill of writing actually fast code needs a low-level understanding. And since people like me learn programming the “easy” way, with interpreted languages, they will never be able to write fast code. Which is a shame, because scripting languages are the de-facto standard for most applicative layers and server-side layers.
So, at the end, I fear everyone is writing shitty code heavily under-optimized because it’s easier and nobody got time for low-level optimization in this fast-paced digital world. Which would be fine if digital did not rely on electricity to run. I fear we are wasting a lot of precious energy with all that, and people keep buying more powerful CPU just to allow devs to keep procrastinating optimization. Strange world.
Horse image courtesy of someone on darktable.fr forum. Sorry, I forgot who.