How hard would it be to display the contents of an image file on the screen? You just load the image pixels somehow, perhaps using a readily available library, and then display those pixels on the screen. Easy, right? Well, not quite, as it turns out.
I may have some experience with this,1 because I made an image viewer that displays images in the terminal emulator. But why do such a thing, there are countless image viewers already available, including those that work with terminal emulators, why write yet another one? That’s an excellent question! As always2, the answer is because no other viewer was good enough for me.
For example, catimg uses stb_image to load images. While stb_image is an outstanding library that can be integrated very quickly, it doesn’t really excel in the number of image formats it supports. There’s the baseline of JPEG, PNG, GIF, plus a few other more or less obscure formats.
Another example is viu, which again is limited to the well-known baseline of three “web” formats, with the modern addition of WebP. Following the dependency graph of the program shows that the image loading library it uses should support more formats, but ultimately I’m interested in what the executable I have on my system can do, not what some readme says.
The overall situation is that there is widespread expectation and support for viewing PNG files (1996), JPEG files (1992) and GIF files (1987). So… what happened? Did image compression research fizzle out in the XXI century? Of course not. There’s JPEG XL (2022), AVIF (2019), HEIC (2017),3 WebP (2010). The question now is, why is there no wide support for these image codecs in software?4 Because nobody uses them. And why is nobody using them?5 Because there’s no software support.
So maybe these new formats just aren’t worth it, maybe they don’t add enough value to be supported? Fortunately, that’s easy to answer with the following image. Which of the quality + size combinations do you prefer?6
But that’s not all. There is a variety of image formats that are arguably intended for more specialized use. And these formats are old, too. Ericsson Texture Compression (ETC) was developed in 2005, while Block Compression (BC) and OpenEXR date back to 1999. BC is supported by all desktop GPUs, and virtually all games use it. ETC is supported by all mobile GPUs. So why is it nearly impossible to find an image viewer for them?7
And speaking of texture compression, I also have an ETC/BC codec which is limited in speed by how fast the PNG files can be decoded. There are some interesting observations if you look into it. For example, PNG has two different checksums to calculate, one at the zlib data stream level, and the second at the PNG data level. Another one is that zlib is slooow. The best you can do is replace zlib with zlib-ng, which provides some much-needed speed improvements. Yet how much better would it be to replace the zlib (deflate) compression in PNG files with a more modern compression algorithm, such as Zstd or LZ4?8 The PNG format even supports this directly with a “compression type” field in the header, but there’s only one value it can be set to. And it’s not going to change, because then you’d have to update every single program that can load a PNG file to support it. Which is hopeless.9
Interlude Link to heading
Some time ago, I submitted a bunch of patches for KDE that were largely ignored, which, let’s say, annoyed me a bit. I went off to write my own Wayland compositor. In the meantime, KDE got better enough to not bother me as much, and the exciting task of figuring out how to draw two textured triangles with Vulkan, potentially using two different GPU drivers at the same time, became rather tedious, so the project was shelved.
But one of the things I had to do was write an image loader. To show the desktop background or the mouse cursor instead of just colored rectangles. Little things like that really do make a difference.
Mouse cursors Link to heading
Loading mouse cursors is not yet available in vv because it requires some special handling with animations, hot spots, and so on. Extending this very specialised implementation to the general image loading functionality requires some thought, and at the moment it’s hard to test if it would even work as intended, so there you have it.
It’s still an interesting enough topic to talk about.
Xcursor Link to heading
To load Xcursor images, as used in the X Window System, you can use the Xcursor library, or one of its forks (?), like the wayland-cursor library. The downside of this approach is that the file format remains a bit of a mystery. I did not want it to be a mystery.
It turned out that the Xcursor files are simple enough to parse on your own in about 20 lines of code. You need to read the file header, then the table of contents structure, and then individual images (more than one forming an animation), where the image data is RGBA 32-bit pixels.
struct XcursorHdr
{
uint32_t magic;
uint32_t header;
uint32_t version;
uint32_t ntoc;
};
struct XcursorToc
{
uint32_t type;
uint32_t subtype;
uint32_t pos;
};
struct XcursorImage
{
uint32_t width;
uint32_t height;
uint32_t xhot;
uint32_t yhot;
uint32_t delay;
};
KDE recently started using SVG mouse cursors, which is a step in the right direction, but it’s also generally not related to the previous Xcursor file format.
Side adventure: does it even work? Link to heading
Looking at the wlroots implementation of cursor loading, I noticed some problems with Xcursor path handling and theme inheritance. It turns out that it’s trivial to reliably crash KDE by placing a cursor theme which inherits itself, in a known location. It was also possible to segfault any wlroots-based compositor by doing the same thing, but in a completely undocumented directory, because the path handling was kind of bad.10 The wlroots path problem seems to be now fixed, but only by accident, as the commit message describes it as a “cosmetic change”.
Side adventure: cursor types Link to heading
The type of cursor determines what shape it should have. It can be the usual arrow pointer, or the hourglass-like busy indicator, the I-beam to indicate text entry, the directional arrows to indicate something is resizable. You get the idea. Hilariously enough, the currently agreed-upon set of cursor types (in Wayland, for example) is what was selected for the needs of web development.
Before modern standardization, it was all a hodgepodge of mostly wrong ideas and bad implementations. Here’s what I think is the complete list of original X cursors, taken from some old book:
This list may be a bit too symbolic here and there, so let’s take a look at more recent (well, 2003 recent) renderitions of some of the most absurd entries:
- A boat,
- A reload icon, but a cursor,
- A copyright infringement of some 50s TV show,
- A jolly roger,
- A sailboat,
- A space shuttle,
- A Cryengine logo,
- An USS Enterprise.
Good? Okay, so how would you implement an X cursor theme on a modern system? Easy, just draw a bunch of cursors, then add a semi-random collection of symlinks on top of them, because nothing is standardized or documented. To show a “question mark” cursor, GNOME will refer to help
, but KDE might want whats_this
. That’s just a simple example, but the list is much longer, and even includes some hashes, because why not.11
Here’s how ridiculous all this is, summed up in one bug report:
Windows cursors Link to heading
The Windows .cur
cursor file is basically the .ico
icon file, with some minor changes. It contains a header, an image directory (for animated cursors), and then the image data. The image data is a Windows Bitmap payload, so you can just use the existing BMP file loader you have.
Well, not exactly.
The BMP height is double the icon height, because the BMP payload contains both the image color data and the alpha mask. To make things worse, the color and alpha halves very likely will use different bit depths12. Alpha is always one bit per pixel, and the bit depth just changes in the middle of a data stream.
Then all the extra quirks come out that you have to take into account. For example, you really don’t want to touch the alpha channel when the color payload is 32 bpp. And the image data may be in the raw format, whatever that means, I surely don’t know. Or it could be a PNG file. Also, make sure that the RGB order is correct and that the image is not flipped vertically. And that you calculate the 4-byte aligned 1-bit stride correctly for cursors that are not power-of-two in size (for some rare 48 px cursors). And of course, there may be several versions of the bitmap data in a cursor, for different numbers of bits per pixel, or sizes, so choose the appropriate one! Clear as mud, right?
Now you have the cursor image loaded, and you can view it at 1:1 scale, as on the left in the image below. But what happens when you scale the image up, as in the image on the right?13
Where are those annoying glowing pixels coming from? It turns out that this mouse pointer is actually repurposed from another type of cursor, with some pixels masked out. It does not matter when you draw the cursor 1:1, but with the filtering applied when the cursor is scaled, the colors bleed out, so you need to take care of that too.
For what it’s worth, animated Windows cursors don’t need to store repeated frames more than once, unlike X cursors, which is nice. It is implemented by using an optional animation index table, which just references the actual image data.
Side adventure: RIFF Link to heading
Windows cursors are stored in RIFF containers. RIFF is basically a collection of chunks, each consisting of a FourCC, a 32-bit data size, and the data payload. The byte order is the difference between the original Amiga’s IFF and Microsoft’s RIFF.
This format was invented in 1985, and at the time it must have seemed like a fantastic thing. It was general enough to store any kind of data, it could be extended in the future, it was easy to parse and load. What’s there not to like?
In practice, this is all a misfeature that should never happen.
- Having a generic container for everything doesn’t make sense. It’s a cool concept on paper, but when you have to write file loaders, it turns out that you have very specific needs, and none of those needs play well with “anything can happen lol”.
- Extending the data format may have been a nice idea in the 1980s, but we now know that file formats are eternal and you can’t just add or change something because all previously compatible loaders will break.
- Nothing is easy to load or parse when you technically have to create a table of contents of all the chunk types in the container and then load them all in the right order. I suspect that all files out there follow the unwritten agreement that the chunks should be present in parsing order, since doing anything more complicated would be way over the head for the memory-constrained machines of the 1980s. But that’s a strong if, and there may be valid files that would require parsing chunks in a non-linear fashion.
Side adventure: where’s my gamma? Link to heading
When I hooked up the loaded mouse cursors into my compositor, I noticed something was not right with how they looked. Then I started seeing it everywhere, including the Desktop Mode on the Steam Deck.
After some debugging, the problem was narrowed down to only be happening on AMD GPUs, and Tom Forsyth provided a reasonable explanation.
Image loading libraries Link to heading
With the mouse cursors done, I moved on to implementing image loading functionality for the purpose of having a desktop background in my compositor. I wanted support for modern formats like AVIF or JPEG XL, and it all went fairly quickly and relatively smoothly. Include a library, follow the documentation, get an image, rinse and repeat.
Back in June 2023, I did a little review of the various image format loading libraries available. (It turned out to be completely wrong, but let’s not spoil the surprise.)
stb_image tier: libwebp
Call a function. You’re done.
Good tier: libheif
A bunch of functions you need to call in sequence, all presented as a neat and short program in the README.
Okay tier: libpng, libjpeg
You have to go through a narrated guide on what to call. You can get the work done, but you have to wade through super important stuff like low-quality decoding in case you want to run on Amiga.
Bad tier: libjxl
You get doxygen docs, so you don’t know where to start. The example program is your best bet, but it is convoluted and outputs each color channel as a float.
The documentation is sometimes vague. To check if a file might be JPEG XL, you have to provide “the beginning of the file”. The function may fail and ask for more data. The documentation never specifies how much is needed. Reading the source code shows that it’s 12 bytes.
The decoding process requires repeated calls to JxlDecoderProcessInput. The documentation lists several return codes you must handle. You start to build up a mental map of how all of this is supposed to work, and then you realize that some of it does not make any sense at all. Then you read the source code and find out that the decoding function can actually return a much wider variety of return codes, but the documentation does not tell you that.
At this point, Aras chimed in with an interesting tidbit about libtiff.
Interlude Link to heading
On April Fool’s Day 2024, I was looking for a good image viewer on Linux, and it turns out they are all bad in one way or another. For context, I’m running the Wayland desktop with fractional scaling, and that’s a minefield of sorts.14 Here’s an example.
The image I’m looking at here is completely transparent, with only the four corners marked to indicate where it begins and ends. The checkerboard pattern is drawn in the background by the viewer to indicate, by convention, where the transparent area is. So why does it only cover part of the image? Because fractional scaling, or maybe just DPI scaling in general, I have not checked how it works with 200% scale.
That may be a silly issue, but there’s also one that is much more impacting, as shown below.
This is… what?! Is my image viewer lying to me? Is it not able to display an image correctly, the only functionality that needs to be 100% reliable in an image viewer?
It can be hard to see exactly what is happening with an image that has random content, so I created a pixel checkerboard test pattern image and opened it in gwenview. This is what it displayed instead of a smooth surface:15
As for the other image viewers, it’s all a roll of the dice. One will depend on Qt being built with the right options, otherwise it won’t support JPEG XL. Another will not be able to load a 16k×16k PNG file, because who uses such absurdly large images? Yet another will pan the image at 5 FPS because somehow everything is software rendered.
In other cases, support for various image formats will be lost over time due to maintenance overhead and because no one seems to care.
April Fool’s Image Viewer Link to heading
I set out to write my own image viewer. I had already done the image loading, so only the output side needs to be handled. It would talk directly to the Wayland compositor, and use the appropriate protocol to handle fractional scaling in the right way to get a 1:1 pixel representation of images on the screen. I had previously done something similar for the Tracy Profiler,16 so it should be easy.
Long story short, I can now view the images properly, but it uses Vulkan in not really the right way, and it requires a lot of polishing to get it where it should be, so I never released it.
By the way, going with what seemed like a “smart” name that played on already existing abbreviations only resulted in me being confused as to whether the image codec was called AVIF or AFIV. Not recommended.
vv Link to heading
Fast forward half a year, and I finally got annoyed enough with viu’s lack of support for modern image codecs to decide to do something of my own. How hard can it be? I have an image loader waiting to be used, and I just have to figure out how to display the pixels in the terminal.
Unicode block elements Link to heading
There is a Unicode block that contains block elements (yay for semantic overloading!). Here are some examples of these elements: █ ▇ ▆ ▅ ▄ ▃ ▂ ▁. Or maybe something like that: ░ ▒ ▓. You get the point, these are meant to make crude graphics possible.
The most interesting block element is the one that’s half filled and half empty: ▄, or perhaps its negative counterpart: ▀. Typically, the font used in a terminal is more or less twice as high as it is wide, so by setting the foreground and background colors appropriately and using these half-block characters, you can get two very large squarish pixels.17
The pipeline is thus as follows: load the image, query the terminal size18, resize the image to fit the terminal19, print out the pixels by emitting ANSI color codes to set the two halves of the half-block characters, and this results in the following output:
It may be worth noting that Chafa uses more of the block symbols, but the printouts it makes look ugly to me, like a JPEG image compressed with very low quality.
Side adventure: ANSI escape sequences Link to heading
The concept of setting terminal attributes dates back to the 1970s, and it really feels like a very old technology now. The nitty-gritty details can be read in the document XTerm Control Sequences, but the general gist is that you print a special sequence of bytes, and that changes something about how the text is displayed on the terminal.
You’ve probably seen or even used things like “print ESC[31m
(or \033[31m
, or \x1b[31m
) to set the color to red”, but how exactly does that work? Well, I don’t really know, because the Control Sequences document is way too complicated, and at the same time it avoids defining some stuff that was probably assumed to be common knowledge, but here’s my understanding of things.
The ESC
element is “escape code 27” (straight out of the ASCII table) and indicates the start of a control sequence you want to send. You have to literally print out 27 as a byte, which can be done either by writing it as \033
, or \x1b
, and probably some other way too.
The ESC
[
sequence is the Control Sequence Introducer, or CSI
. Supposedly it’s also 0x9b as a byte, but nobody uses it that way?
The CSI
Pₘ m
sequence sets Character Attributes. Pₘ is any number of single parameters (Pₛ), separated by the ;
character. Example single parameters for character attributes are 1
to make the text bold, 4
to enable underline, 3
1
to set the foreground color to red, and so on.
Thus, the above ESC[31m
ANSI sequence can be decoded as CSI
3
1
m
, which sets the foreground color to red. Similarly, ESC[31;4m
contains two different parameters, and it sets the foreground color to red and also enables text underlining.
I think it’s safe to say that most people would assume that there are only 16 colors available on the terminal, more or less based on the 1981 CGA text mode color palette. The eight parameters 3
0
to 3
7
set the basic color to be used, and then you can also use the 1
bold parameter to make the selected color brighter.
It is possible to use the more advanced 256-color mode in the terminal, where you choose the color from the palette of predefined colors. It still uses the Character Attribute ANSI sequence, but with the parameter set to 3
8
:
5
:
Pₛ, where Pₛ is the color index.20
Finally, there’s the true-color mode, where the parameter is 3
8
;
2
;
Pr ;
Pg ;
Pb. The parameter values in this command are, of course, the RGB color values.21
In vv I just assume that the terminal you have is capable of true-color display. This was standardized in 1994, so if the terminal of your choice doesn’t support it, I don’t really care.
Sixel Link to heading
Sixel is another ancient technology, originally introduced in DECwriter IV in 1982. It was largely forgotten, only to be rediscovered recently. It is now more or less widely supported, but is lacking in some key areas. One of these is the limited number of available colors, which requires the use of dithering.
To get the control sequences used to output sixel images, you can use the libsixel library. I use it as a fallback, but it’s not in really good shape, because I don’t really have the means to test it properly, and the library itself is largely undocumented.
Kitty graphics protocol Link to heading
This is the main driver for outputting graphics in vv. It is relatively well documented, has support in the terminals I use, and the image data is both true-color and supports an alpha channel.
The Kitty protocol is also very over-engineered, with a lot of seemingly unnecessary features. But if you only want to write an application that uses certain parts of the protocol, you don’t really need to worry about all the extra stuff you won’t be using.
To send the RGBA pixel data to the terminal, you start with a header message that specifies the image size and some other image data details. Then you send the data, first compressing it with deflate22, then encoding it with base64. The payload must be split into 4KB chunks.
This results in a full color image being displayed in the terminal.
Side adventure: the terminal, again Link to heading
As it turns out, the terminal control sequences are not a one-way road. The terminal may want to respond to your query. For example, if you send the CSI
c
Send Device Attributes sequence, the terminal may respond with CSI
?
1
;
2
c
to indicate that it is “VT100 with Advanced Video Option” (whatever that means).
As a side note, you can use this query mechanism to determine if your terminal supports sixel images, or the kitty graphics protocol.23 I was very surprised to discover that the viu viewer also supports the kitty protocol, as it only printed sixel images on my Konsole terminal. It turns out that viu doesn’t do proper detection, and only enables the functionality if the TERM
environment variable is set to kitty
. Which it isn’t on Konsole, even though Konsole supports the protocol. That’s not what the variable is for!
Getting back to the topic, in order to be able to read the terminal’s response, you have to do some magic first. This code is just following what I found somewhere, and it seems to work as intended, so I didn’t dig into it too much.
constexpr std::array termFileNo = { STDIN_FILENO, STDOUT_FILENO, STDERR_FILENO };
int OpenTerminal( struct termios& save )
{
int fd = -1;
for( auto termfd : termFileNo )
{
if( isatty( termfd ) )
{
auto name = ttyname( termfd );
if( name )
{
fd = open( name, O_RDWR );
if( fd != -1 ) break;
}
}
}
if( fd < 0 ) return -1;
if( tcgetattr( fd, &save ) != 0 )
{
close( fd );
return -1;
}
struct termios tio = save;
tio.c_lflag &= ~( ICANON | ECHO );
tio.c_cc[VMIN] = 0;
tio.c_cc[VTIME] = 0;
if( tcsetattr( fd, TCSANOW, &tio ) != 0 )
{
close( fd );
return -1;
}
return fd;
}
void CloseTerminal( int fd, struct termios& save )
{
tcsetattr( fd, TCSAFLUSH, &save );
close( fd );
}
This will give you a terminal file descriptor to write to and read from. Since you don’t know if the terminal will respond, you should poll for data being available before reading (or you will deadlock). You should also use a sufficiently long timeout period, as you may be running on a slow ssh connection, and the data may appear after quite a delay. Yeah, this is really an ancient technology, not really suited for modern use cases.
iTerm2 protocol Link to heading
This is another true-color graphics display protocol that is available across many terminals now. It feels more like a joke platform-specific hack, as it specifies the image data to be transferred as “any image format that macOS supports”.
On the topic of “which protocol is the best”, I think I can recommend the typical FOSS SNAFU discussion. The only upside, compared to the Wayland situation, is that there’s already a good protocol implementation existing that I can just use, instead of waiting for these endless debates to finally figure out the most simple and obvious things.
No, seriously, the idea that every terminal should implement support for every image format on its own, and do it in the right way (which we have yet to cover), is just something that will never happen.
High Dynamic Range Link to heading
There are certain formats, such as OpenEXR, that contain image data that cannot be displayed on a SDR display. Consider the following sample EXR image, where the RGB color channels are displayed directly as they are stored in the image file (this is also how gwenview displays it):
Overall, the image is too dark. There are also large areas where the color is clipped, creating ugly oversaturated solid blobs of the same color. That’s the thing about HDR images. They can contain a lot of fine detail in dark areas. They can also contain very bright lights. None of this fits with what an SDR display can show. To view the image properly, the dynamic range must be compressed to what the monitor can display.
The process of doing this is called tone mapping. There are many different algorithms for doing this, and choosing a particular one is a matter of taste. For vv, I went with the PBR Neutral operator, following Aras’ recommendation. The main reason for choosing it was that implementing it required only a little bit of math, instead of using a rather large lookup table. The other suggested tone mapping operators were Tony McMapface and AgX.
It is important to note the way image data is specified in EXR files. The color values are in linear space, corresponding to measurements of the amount of photons from the given source in the given time24. Double the number of photons, double the brightness, double the linear value.
But this is not how humans perceive light (or sound). The response of our visual system is more like an exponential curve, and that’s another big topic of gamma correction that I’m not going to get into. Long story short, the linear color values (tone mapping works with linear values) have to be converted to the “exponential” sRGB color space in order to be displayed correctly by the monitor.
With the tone mapping and the sRGB conversion, the image has a lot more detail, it is brighter in the dark areas, and the bright areas do not have their colors crushed.
How important is it to handle the HDR images correctly? Why bother with HDR when you have the SDR monitor anyway? Isn’t it more of a movie thing anyway?
Well, it is what you make it. These HDR movies have to be made somehow, and to properly view the HDR movie data or HDR still images, you need an HDR monitor and a proper HDR pipeline in your operating system. Or what if you want to watch an HDR video on youtube that works in the confines of your web browser? Or maybe the HDR content is embedded directly into the web page you are viewing, like below?25
As far as games are concerned, the 2005 release of Half-Life 2: Lost Coast was commonly known as the showcase of HDR rendering, and HDR has only become more widespread since then. You simply must render with high dynamic range to make things look realistic. The game’s rendered output is typically tone-mapped for display on an SDR monitor, but recent game releases allow you to bypass that step and deliver the HDR output directly to HDR monitors.
Linear to sRGB Link to heading
The conversion from linear space to sRGB is commonly approximated with a $1/2.2$ power function. This is incorrect, the actual conversion is as follows:
$$ L' = \begin{cases} 12.92 * L & \text{if $L < 0.0031308$,} \\ 1.055 * L^{1/2.4} - 0.055 & \text{if $L >= 0.0031308$.} \end{cases} $$Have a look at the images below for the difference.
While the trees at the bottom of the image are easier to see in the approximated version, the camera noise and color banding are much more visible. The linear part of the transform was explicitly designed to minimize this. Also note how the correct transformation results in more saturated colors.
Vector images Link to heading
I added support for vector images (SVG, PDF) in the second major release of vv. This required a different approach, because unlike raster images, which are loaded at their native size and then rescaled to fit the available space on the terminal, vector images have no innate size26 and should be rendered at the terminal size.
The libraries for both formats (librsvg, poppler) are quite easy to use, and both use the cairo library to do the actual drawing of the vector data.
Side adventure: GPL Link to heading
The poppler library is licensed under the GPL. I don’t want my code to be GPLed.
As everyone on the internet will tell you, if you use a GPL library, your program must be GPL as well. If you don’t want that, there is the LGPL license. This is common knowledge.
Well, I have actually read the text of the GPL, and gave it some though. And I call bullshit.
- Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
(…)
c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. (…)
This is the text that gives GPL the “viral” property. Makes sense, right? If you do a work (your program) based on the Program (in this case, some GPL library), you must also license the entire work (your program) under the GPL. When you link your program to the GPL library, you are surely basing your program on that library, right?
That’s not how it works. Let’s go back to the license text.
- Definitions.
(…)
To “modify” a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a “modified version” of the earlier work or a work “based on” the earlier work.
(…)
To “convey” a work means any kind of propagation that enables other parties to make or receive copies. (…)
What is the title of section 5 again? “Conveying Modified Source Versions”. Do you copy or modify anything when you link? Yes, you do! You need to get the function prototypes from the headers, and you need to know the name of the library you are linking against!
The problem is that these are statements of fact, and thus do not fall under copyright protection, which excludes them from being modifications. This is what the license text says. And it would be true even if the license did not explicitly say so.
It’s interesting to think about what is actually copyrightable. This text provides some commentary on the subject. Or think about phone books. You cannot copyright a list of names and phone numbers, because those are just simple facts. However, if you were to assign some kind of rating to each person (e.g., on a nice-rude axis), then that rating would be copyrightable. Notably, the library headers will contain comments, and these do fall under the copyright. Whether this is significant, if all you do is machine processing which ignores the comments, is a matter to discuss for the lawyers.
If that does not convince you, perhaps a quote from the US Supreme Court will:
Google’s copying of the Java SE API, which included only those lines of code that were needed to allow programmers to put their accrued talents to work in a new and transformative program, was a fair use of that material as a matter of law
Anyway, here is what vv does to link to the GPLed poppler library. This was done using only the documentation.
typedef void*(*LoadPdf_t)( int, const char*, GError** );
typedef void*(*GetPage_t)( void*, int );
typedef void(*GetPageSize_t)( void*, double*, double* );
typedef void(*RenderPage_t)( void*, cairo_t* );
auto lib = dlopen( "libpoppler-glib.so", RTLD_LAZY );
LoadPdf_t LoadPdf = (LoadPdf_t)dlsym( lib, "poppler_document_new_from_fd" );
GetPage_t GetPage = (GetPage_t)dlsym( lib, "poppler_document_get_page" );
GetPageSize_t GetPageSize = (GetPageSize_t)dlsym( lib, "poppler_page_get_size" );
RenderPage_t RenderPage = (RenderPage_t)dlsym( lib, "poppler_page_render_for_printing" );
This is probably a bit exaggerated in how evasive it is, but it works. Go and argue that vv is a poppler derivative when you don’t even need to have poppler installed to build the program. Explain how someone could be bound by a license to something they do not use.
The other interesting legal document is Directive 2009/24/EC of the European Parliament and of the Council of 23 April 2009 on the legal protection of computer programs, which has some interesting things to say about interfaces and interoperability and how much say the copyright holder has in these things.
Finally, remember that the only legal text you are bound by (if at all) is the actual text of the GPL license. It does not matter what Stallman says he intended with the GPL, or what the GPL FAQ says should happen in some fantasy land.
Animation Link to heading
The common knowledge is that if you want to have an animated image, you use the GIF format. While this may no longer be true, as most websites now deliver animated images as MP4 videos, those same websites will lie to you and tell you that it’s GIF this or GIF that, because that’s what people understand as an animated image.
It turns out that there are many other ways to deliver animated images. One example of this would be the APNG format, but I have never seen it out there in the wild. Another example is the WebP format. Unlike animated GIF or PNG images, which you can just load and get the first frame, animated WebP images will not load with the basic loader library. You have to use a different library, with a different API, all of which is not well documented, in order to even display anything in case of some WebP images. Considering how disjointed all of this is, and that there’s no overall design in sight, the animation feature was probably done as a cool and quick hack, and there was no one around with enough awareness to prevent it from happening.
WebP is one of the I-frame image codecs. In a proper video codec, I-frames are used to represent a complete video frame without references to other frames. This allows you to seek in the video, or make sure the video quality is high enough when there’s a quick scene change. Video codecs will also use P and B frames to encode only the changes between the frames in the video, at a much lower bitrate cost. The downside of this delta compression is that these frames are based on the content of previous (or next) frames, so you can’t just quickly seek to any one of them, you have to decode all the previous frames, starting with an I-frame.
In essence, someone thought “hey, we have all these advanced image codecs for keyframes in videos, why don’t we use them to encode images?” and WebP was created, based on the VP8 I-frame format. HEIC images use HEVC/H.265 I-frame encoding. And AVIF images use AV1 I-frame encoding. There’s also the VVC image format based on H.266, but adoption seems to be very limited as of now.
To summarize, the WebP image format is based on parts of the VP8 video codec. The animation support in WebP is not done by adding parts that were removed in the process, removed from the video codec, from the very thing it is now trying to replicate. It is done by concatenating multiple WebP images together in some sort of container format. Try to see the logic in that.
Side adventure: Kitty protocol revisited Link to heading
Kitty provides proper support for animating images in the terminal. You would probably think that the animation would have to be done by repeatedly sending the individual image frames that would replace what you had on display before. This isn’t the case with the Kitty protocol. Instead, you define an animation by providing all of its frames, and then start the animation loop. The animation will continue to play even if your program is stopped, as in the video below.
One big practical drawback is that support for this type of animation is currently very limited. For example, the KDE Konsole or Ghostty don’t support it at all.
HDR everywhere Link to heading
At this point I thought that there are many ways to store the HDR image data, and the only format where I have complete control over the pipeline is OpenEXR. It would be nice to use the same algorithms for all formats instead of having a random selection of output processing.
I started working on this with the RGBE or Radiance HDR format. The stb_image loader only does the linear to sRGB conversion if you want the typical 8 bits per color channel data, but it also provides a way to get the raw HDR floating point color values that you can put into the tone mapping pipeline. The results are shown below.
Then I looked for some HDR AVIF files,27 and this is what I got. The images were bland, desaturated, lacking contrast. They looked like this not only in my image viewer, but also in KDE Dolphin’s thumbnails, in Gwenview image viewer, in Okular PDF viewer, in GIMP. This is obviously wrong! Yet none of these fairly popular applications were able to get it right?
At this point @rfnix@piaille.fr chimed in with talk about “PQ-encoded”, “ICC profile” or maybe “CICP/nclx” not being handled properly. And what is that anyway, why should I know, I just want to load the RGB pixels from an image and have them look good on the screen. Yeah, it doesn’t work that way, as it turns out. The simple API the libraries give you is a trap that will not output the correct data in the more advanced cases.
Color management Link to heading
Let’s ignore HDR completely for the time being. The color management problem applies just as well to standard SDR images, such as JPEGs. Take a look at the images below.28
If I were to show you just one of these pictures, any one of these pictures, you would say that it is a good picture. But each picture has different colors! Which one is right? How would you know? Maybe it doesn’t matter that much, the difference is quite small and only noticeable if you have another picture to compare it with…
The Rec. 709 one is what it should look like. It defines the color primaries and white point used in sRGB that your browser and display expect. The conversion would be an identity in this case.
And it really does matter a great deal, as you can see from the images below.
Not quite the same, right? But they are the same if you do your color management right!29
Okay, but wait, why does this thing even exist? Why would anyone save a photo in a different “color space”? What is a color space anyway? Aren’t you just going to show a picture on the screen anyway? To understand this, let’s take a look at the chromaticity diagram.
The colored area represents all the colors that a human can see. The triangular shapes on the graph correspond to the color ranges (gamut) that can be represented by different color spaces.30 As you can see, the sRGB gamut is quite limited in what it can show. Have you ever wondered why photos of a vibrant sunset sky always look a bit bland compared to the real thing? This is why. The monitor simply cannot reproduce the right colors.31
Working with a non-sRGB gamut gives you more room to maneuver in image processing, even if the resulting image will be displayed on a sRGB display. Or maybe you have a fancy screen that can display all those extra colors? For example, if you are proofing images for printing, in which case you will need to target the CMYK color space, which has a significant area outside the sRGB gamut (but within the Adobe RGB gamut).
As it turns out, the normalized 0 to 1 ranges of the red, green, and blue channels don’t mean anything by themselves. They are all defined in the context of a color space, and to get the right values to display on the screen, you need to convert these color values from the image color space to the display color space.
The reason why “nobody” seemingly cares32 about this is the same as for gamma correction. This is computer graphics, and it’s all highly subjective (e.g., which digital camera manufacturer has the best color reproduction? Fight!), it’s not obviously wrong in most cases, and it requires some in-depth knowledge to even realize things are not correct.
Here’s a much more dramatic example of what happens when you don’t care about color management. This is a test image that wants things to look wrong, if handled incorrectly.
Still not enough? Here’s a nice example of what can happen even with the venerable JPEGs.
And it gets even better. Here’s a specially crafted single PNG image that can be displayed in 10 different ways, depending on the correctness and quirks of the image processing software.
The color space definition is stored in images in the form of ICC color profiles. For the purpose of handling them, it is enough to know that they are just some binary objects of some size that you need to pass from the image to the color management library via some function calls.
To implement color management in vv33 I decided to use the Little CMS library. It’s quite easy to use, and already a dependency for a lot of software. First you need to create input and output color profiles, for example by using cmsOpenProfileFromMem()
to load an ICC color profile embedded in the image and cmsCreate_sRGBProfile()
to create a sRGB profile for display on the screen. Then you use both profiles to create a transform with cmsCreateTransform()
, which can be applied to pixel data with cmsDoTransform()
. The transform seems to be thread-safe, and the input and output buffers can be the same (if the data types match). At least I haven’t had any problems using it in such a configuration.
HDR once again Link to heading
Okay, so those not-quite-right-looking HDR images I showed you earlier are probably in some kind of HDR color space that needs to be handled properly, and that’s it, right?
Well, no, not really. That would be a reasonable assumption, but instead we have to write what feels like half of a video codec processing pipeline. Wait, what?
The image above was retrieved by calling the following libheif function, where we request the image to be decoded as RGB because that’s what we want to display on the screen in the end. Another reasonable assumption would be that the library will return an image that looks correct in such a case.
heif_image* img;
heif_decode_image( hnd, &img, heif_colorspace_RGB, heif_chroma_interleaved_RGBA, nullptr );
This gives us data that is 8 bits per channel, which is not what we want for an HDR image. It is possible to get the 16-bit RGB data with the heif_chroma_interleaved_RRGGBBAA_LE
enum value, but it is all horrible and wrong and I can’t even remember the details of what I did to try to get it to work. Let’s not talk about it.34
The other available option is to request heif_colorspace_YCbCr
. Let’s see what it is.
Luminance Link to heading
We need to start with some definitions that can often be quite confusing. As mentioned earlier, we can measure the amount of photons hitting, say, the light-sensitive “pixels” of a camera. This linear measure is called luminance and is expressed in nits, or candelas per square meter. However, this measure is not really useful when we need to encode things as an electrical signal, which is just a voltage, such as between 0V and 5V. So for practical applications, the relative luminance is used, denoted as Y. Its value is calculated by dividing the luminance value by the maximum expected value of luminance, and it results in a 0 to 1 range that can be easily encoded.
If we have a color image, we can calculate the relative luminance of its pixels (i.e., convert it to grayscale) by assigning certain weights to the color channels, corresponding to how humans perceive color, and summing them. For correct results, the color channel values should be linear, not gamma compressed. Here are the weights as defined in Rec. 709 (other standards may use different weights):
$$ Y = 0.2126 * R + 0.7152 * G + 0.0722 * B $$For the sake of practicality, and because getting things exactly right isn’t always the first priority, there is also luma, denoted as Y’. This is the weighted sum (the coefficients are the same as for relative luminance) of the gamma-compressed color channels. In reality, the terms luma and luminance are often mixed up, even in the specialist literature, and Y’ is often written simply as Y.
Television Link to heading
The television signal was originally transmitted as a series of black and white images. Expanding the signal to include color, while maintaining compatibility with existing TV sets, required some smart thinking.
To make a long story short, the already existing black and white luma signal was supplemented with two color difference channels, calculated as follows:
$$ U = B' - Y' \\ V = R' - Y' $$As you probably figured out, the ’ is an indication of gamma compression. The exact formula is a little bit different, and there are a couple of coefficients in there that I have left out, but none of that is important here.
This encoding is known as Y’UV, or simply YUV, and you can see what it looks like in the image below.
The human visual system is much more sensitive to brightness changes than to color changes, so it was possible to reduce the resolution of the color difference channels to reduce bandwidth requirements. This is known as chroma subsampling.
The following image shows some common subsampling patterns that you will find in use.
Chroma subsampling can be used in many different places. Image or video compression is one such place. Displaying on a TV or monitor is another. If your display subsamples the color signal, you probably will not see the degradation in photos or movies, because they may already have been subsampled, and otherwise the color distribution would be smoothed over many pixels. But if you were to look at a typical computer screen with a lot of text or fine iconography in such a display configuration, you would immediately notice that everything looks wrong. There are some examples of this on the RTINGS web site.
The YUV encoded image can be converted back to RGB using an inverse transform.
YCbCr Link to heading
This is another color image encoding, but it is very similar to YUV. Technically it should be called Y’CbCr, but nobody cares, and you might as well encounter it as YCC, or simply YUV. Yeah. Just like YUV, YCbCr consists of the luma channel and two color channels that can be subsampled.
As mentioned earlier, libheif gives us the option to load the image with the heif_colorspace_YCbCr
option, and this is the color space we get. Unlike the RGB output, where the color channels are interleaved together in the RGBA pattern35, YUV/YCbCr is a planar format, where each channel is stored disjointly in memory. This sort of makes sense, because chroma may be subsampled, and there would be no way to interleave such color planes.36 Well, at least we can request libheif to always return the planes as 4:4:4, even if they are subsampled in the image.37
The video image codecs store the plane values as 8, 10, or 12 bit integer values which can be normalized to 0 to 1 range floating point values for further processing. And while we’re at it, there’s another quirk to take care of.
While in digital signal processing the full signal range, e.g. 0 to 255 in the case of 8 bits, is not a problem, things are different in the analog world. Using the full range of values would cause the signal to overshoot or undershoot the allowable signal levels (see Gibbs phenomenon), so the range is limited by scaling and offsetting the Y to the 16 to 235 range and the UV to the 16 to 240 range. In order to perform the normalization correctly, it is necessary to check whether the image uses the full range or the limited range, and adjust accordingly if necessary.
HDR profiles Link to heading
We discussed color profiles earlier, and libheif gives you the ability to query for an ICC color profile. But it also has a function to query for an nclx color profile. So which one should you use? Are the two related in any way? Can you use one instead of the other? Little CMS has no support for nclx, so how do you use it? The documentation is very light on this.
As it turns out, an image can have none of these, one of these, or both at the same time. And you have to deal with them accordingly, in different ways, even when they overlap.
We already covered the ICC profile, but what is an nclx profile? Let’s take a look at the corresponding libheif struct (with some fields omitted):38
struct heif_color_profile_nclx
{
enum heif_color_primaries color_primaries;
enum heif_transfer_characteristics transfer_characteristics;
enum heif_matrix_coefficients matrix_coefficients;
uint8_t full_range_flag;
};
The full_range_flag
indicates whether the YUV values are limited or full range. Notably, if the flag is not present (i.e. if there is no nclx profile), the value of the flag is assumed to be false
.
Color primaries Link to heading
The color_primaries
field rings a bell, but why is it an enum? Let’s see what it looks like.
enum heif_color_primaries
{
heif_color_primaries_ITU_R_BT_709_5 = 1,
heif_color_primaries_unspecified = 2,
heif_color_primaries_ITU_R_BT_470_6_System_M = 4,
heif_color_primaries_ITU_R_BT_470_6_System_B_G = 5,
heif_color_primaries_ITU_R_BT_601_6 = 6,
heif_color_primaries_SMPTE_240M = 7,
heif_color_primaries_generic_film = 8,
heif_color_primaries_ITU_R_BT_2020_2_and_2100_0 = 9,
heif_color_primaries_SMPTE_ST_428_1 = 10,
heif_color_primaries_SMPTE_RP_431_2 = 11,
heif_color_primaries_SMPTE_EG_432_1 = 12,
heif_color_primaries_EBU_Tech_3213_E = 22
};
Oh my… What do you even do with this? The answer can be found in the H.273 specification document (commonly known as “coding-independent code points”, or CICP).
Ah, so these are the color primaries coordinates for the red, green, and blue channels, along with the white point location. These are given in xy coordinates, just like in the chromaticity diagram shown earlier. It looks simple now! You can just drop this into the color management library and get a color profile from it.
Matrix coefficients Link to heading
Okay, but the color transformation is in RGB, yet we still have the YCbCr values, don’t we? How do we convert these, is there some sort of function to handle this? To know how to perform the YCbCr to RGB transformation, we need to take a look at the matrix_coefficients
field, which can have the following values:
enum heif_matrix_coefficients
{
heif_matrix_coefficients_RGB_GBR = 0,
heif_matrix_coefficients_ITU_R_BT_709_5 = 1,
heif_matrix_coefficients_unspecified = 2,
heif_matrix_coefficients_US_FCC_T47 = 4,
heif_matrix_coefficients_ITU_R_BT_470_6_System_B_G = 5,
heif_matrix_coefficients_ITU_R_BT_601_6 = 6,
heif_matrix_coefficients_SMPTE_240M = 7,
heif_matrix_coefficients_YCgCo = 8,
heif_matrix_coefficients_ITU_R_BT_2020_2_non_constant_luminance = 9,
heif_matrix_coefficients_ITU_R_BT_2020_2_constant_luminance = 10,
heif_matrix_coefficients_SMPTE_ST_2085 = 11,
heif_matrix_coefficients_chromaticity_derived_non_constant_luminance = 12,
heif_matrix_coefficients_chromaticity_derived_constant_luminance = 13,
heif_matrix_coefficients_ICtCp = 14
};
Again, the correct way to handle all these options is explained in the H.273 specification. It is a very dense, math-heavy section that spans six pages, which I will spare you from looking at. And most of it isn’t even needed at all, because only a few of these values are used in practice, see the table below.39
Knowing this secret sauce makes things simple again. The 0 value indicates that the YCbCr values are actually GBR, and only require a channel swap to get RGB. The 1, 5, 6, and 9 values use the same equation, differing only in the coefficients used for multiplication. The 14 value does not operate in the YCbCr color space and is therefore irrelevant for libheif.
switch( matrix )
{
case Conversion::BT601:
a = 1.402f;
b = -0.344136f;
c = -0.714136f;
d = 1.772f;
break;
case Conversion::BT709:
a = 1.5748f;
b = -0.1873f;
c = -0.4681f;
d = 1.8556f;
break;
case Conversion::BT2020:
a = 1.4746f;
b = -0.16455312684366f;
c = -0.57135312684366f;
d = 1.8814f;
break;
}
for( pixel : image )
{
r = Y + a * Cr;
g = Y + b * Cb + c * Cr;
b = Y + d * Cb;
}
Transfer characteristics Link to heading
The last thing to cover in the nclx profile are the transfer_characteristics
.
enum heif_transfer_characteristics
{
heif_transfer_characteristic_ITU_R_BT_709_5 = 1,
heif_transfer_characteristic_unspecified = 2,
heif_transfer_characteristic_ITU_R_BT_470_6_System_M = 4,
heif_transfer_characteristic_ITU_R_BT_470_6_System_B_G = 5,
heif_transfer_characteristic_ITU_R_BT_601_6 = 6,
heif_transfer_characteristic_SMPTE_240M = 7,
heif_transfer_characteristic_linear = 8,
heif_transfer_characteristic_logarithmic_100 = 9,
heif_transfer_characteristic_logarithmic_100_sqrt10 = 10,
heif_transfer_characteristic_IEC_61966_2_4 = 11,
heif_transfer_characteristic_ITU_R_BT_1361 = 12,
heif_transfer_characteristic_IEC_61966_2_1 = 13,
heif_transfer_characteristic_ITU_R_BT_2020_2_10bit = 14,
heif_transfer_characteristic_ITU_R_BT_2020_2_12bit = 15,
heif_transfer_characteristic_ITU_R_BT_2100_0_PQ = 16,
heif_transfer_characteristic_SMPTE_ST_428_1 = 17,
heif_transfer_characteristic_ITU_R_BT_2100_0_HLG = 18
};
In practice, the options to cover are value 13, which is just the linear to sRGB transfer function we covered earlier, and the Perceptual Quantize (16) and Hybrid Log-Gamma (18) transfer functions used for HDR images.
The reason for these HDR transfer functions to exist is the amount of bits available. While the OpenEXR format can simply store the linear color values as 32-bit or 16-bit (IEEE 754-2008 half-precision) floating point numbers, the video codec formats only have 10 or 12 integer bits to represent the dynamic range, and as we already know, the human visual system’s response to light stimulus is exponential. This does not play well with the evenly spaced integer values.
The PQ transfer function, as the name implies, is based on the characteristics of human perception. It distributes the available integer values over a wide dynamic range, scaling from 0.0001 nits to 10000 nits. The same function is used to transfer each of the color channels.
float Pq( float N )
{
constexpr float m1 = 0.1593017578125f;
constexpr float m1inv = 1.f / m1;
constexpr float m2 = 78.84375f;
constexpr float m2inv = 1.f / m2;
constexpr float c1 = 0.8359375f;
constexpr float c2 = 18.8515625f;
constexpr float c3 = 18.6875f;
const auto Nm2 = std::pow( std::max( N, 0.f ), m2inv );
return 10000.f * std::pow( std::max( 0.f, Nm2 - c1 ) / ( c2 - c3 * Nm2 ), m1inv ) / 255.f;
}
The HLG transfer function is designed to be backwards compatible with SDR displays by combining an SDR gamma curve with an HDR logarithmic curve for color values above 1.0. The HDR images I found were all using the PQ characteristic, so the HLG support in vv is probably not implemented the right way. But what can you do if you can’t test it properly?
Where’s my profile? Link to heading
Going through my test images, I noticed that one of them was a bit odd. It did not have the nclx profile, and according to the AVIF CICP spec, the default matrix coefficient should be 6, or Rec. 601. However, when inspecting the image with avifdec -i
or ffprobe
, the reported coefficient was 1, or Rec. 709. How can this be?
When you load the image with libheif, you get the image handle, on which you can call the heif_image_handle_get_nclx_color_profile()
function to get the nclx profile. However, in some cases the profile may only be available after decoding the image, with the heif_image_get_nclx_color_profile()
function. So you have to take this into account.
Putting it all together Link to heading
To recap, the HDR processing pipeline with libheif is as follows:
- Load the image as YCbCr,
- Convert the integer plane values to floating point, possibly taking into account the limited range adjustment,
- Perform the conversion from YCbCr to RGB, following the matrix coefficients from the nclx profile,
- Do the color management, either by loading the ICC profile or by using the values from the nclx profile,40
- Linearize the color values by applying the nclx transfer function,
- Tone map the image,
- Convert from linear color space to sRGB.
That’s certainly something, isn’t it? But how does it look now? Check out the picture below to see how much better it is.
More examples can be seen in the following post.
It’s too slow Link to heading
Having things render correctly is very nice, but it also made everything a bit slow. One of my test images a resolution of 9504×6336 and it took over 8 seconds to load. This was unacceptable.
Multithreading Link to heading
The most obvious thing to do was to parallelize the image processing. All the calculations I do are local to a single pixel and don’t depend on the neighbors, so it should be embarrassingly parallelizable.
To manage the jobs, I used my TaskDispatch
class that I developed while working on etcpak. It is very simple and had better CPU usage than the other job dispatchers I compared it to a decade ago.
I started by parallelizing each of the processing steps separately to follow the serial way of doing things. Then, after a bit of profiling, I realized that this was quite inefficient. The test image I used was 9504 * 6336 * 4 channels * 4 bytes = 918 MB. So the parallelized color management function had to load that 918 MB, process it, and store the 918 MB back into memory. Then the parallelized transfer function had to load the 918 MB again, do the necessary math, and store the 918 MB again. And so on. No wonder the profiler showed that half of the execution time was in the memory instructions, since everything had to go through RAM all the time.
In order to make this better, I have completely changed the way the loader works. The parallelization is now done at the top level, and each job starts by loading a chunk of YCbCr data, then does all the necessary processing in a small temporary buffer, step by step, and finally writes the finished image section to the output bitmap. The chunks are small enough to fit into the cache, and the memory instructions no longer flare up when profiling.
SIMD Link to heading
Going wide with calculations is another obvious thing to do. My compiler (clang) was already nice enough to vectorize the integer to float YCbCr conversion, along with the YCbCr to RGB routine. Using gcc may not give you the same results, but I don’t really want to bother doing something that has already been done for me.
Since, for reasons, we are still largely tied to the 2003 CPU architecture that only supports SSE2, vv is built with the -march=native
compiler option to be able to use SIMD at all. The other choice is to implement dynamic dispatch, so that a generically built binary can use features from more advanced architecture levels, depending on what the CPU supports. Bur I really don’t want to make the code more complicated just because the whole industry is afraid to stop supporting CPUs released before 2013, when Haswell with AVX2 and FMA became available.
The power function Link to heading
The PQ function requires an implementation of the std::pow()
function to be available. Implementing it for wide processing requires some complex math that I did not want to deal with. That’s something that should be readily available in some sort of a SIMD math library, right?
The first thing I tried was based on some code that has been around for a while. The SSE2 implementation, which does 3 pow()
calculations at once,41 actually turned out to be slower than the serial one. Bummer.
Replacing the series of mul + add instructions with fma (fused multiply-add) tipped the scales in favor of SIMD, albeit slightly. Extending the implementation to AVX2 gave the expected 2× speedup, but the AVX512 version produced artifacts. I later found out that I somehow missed the correct rounding option when reading the SIMD documentation and used the wrong one, resulting in NaN output. But it didn’t matter. I was not happy with the performance of that version.
At this point, Aras chimed in again and recommended another implementation. I wrote a bit of code based on it, and somehow managed to make the same rounding option mistake again, which produced a rather psychedelic image. But I did not know that at that time.
Side adventure: how does it actually work? Link to heading
At this point, I reluctantly decided that maybe it’s time to stop trying random code from the web, and maybe I should actually understand how this should work. This is what I found out before I realized what the real problem was.
First, the power function can be expressed as a combination of exp()
and log()
functions. This has been already clear from prior research and from reading the SIMD implementations.
The first implementation I talked about above used this formula and stuck to using $e$ as the base. This unfortunate decision required some unnecessary back and forth transformations of the numbers. The exponential function and the logarithm can actually be combined together using any base, so the equation can be rewritten as:
$$ x^{y} = 2^{y * \log_2 x} $$This is important when we consider how IEEE 754 floating-point numbers are encoded. To recap, a 32-bit float consists of (counting from the oldest bit):
- 1 bit sign, S,
- 8 bits exponent, E,
- 23 bits mantissa, M.
And the number value is calculated as follows.42
$$ -1^S * 1.M * 2^{E-127} $$Now let’s try to make some sense of this. The sign bit is irrelevant for us, the color values are never negative. Writing the exponent as $E-127$ is only necessary to decode the binary encoding of the number, which we will simplify to just $E$, assuming it is properly biased. The mantissa value written as $1.M$ is actually always in the 1 to 2 range. When this 1 to 2 range is multiplied by $2^E$, the range changes to, for example, 0.5 to 1, or 2 to 4, and so on.
With this knowledge in hand, let’s take a look at another mathematical identity.
$$ \log (a*b) = \log a + \log b $$And guess what, the floating point numbers follow the $a*b$ formula. Let’s put it into the equation.
$$ \log_2 (1.M * 2^E) = \log_2 1.M + \log_2 2^E $$Which can be simplified as follows.
$$ \log_2 (1.M * 2^E) = \log_2 1.M + E $$As it turns out, with base 2, all we have to do is calculate the logarithm of the mantissa and add the value of the exponent to get the logarithm of the whole floating-point number. And since the mantissa will always be in the range 1 to 2 (or 0.5 to 1 if the exponent is biased), we can approximate its logarithm quite accurately with a polynomial function.
The exponential function is also approximated by a polynomial and some similar tricks, but I have not had a close look at the implementation details.
PQ transfer function Link to heading
Once the power function was taken care of, implementing the PQ transform as a SIMD function was fairly straightforward. The scalar version of the code took 1.56 s to run. The SSE 4.1 + FMA version took only 528 ms, which is about a 3× speedup. Nice!
Widening the SIMD code was trivial because there is no crosstalk between lanes, which would otherwise require costly shuffles and permutes. The AVX2 version ran at 276 ms, while the AVX512 version required only 145 ms.
Finally, by enabling parallelization, which we have already discussed, the run time was further reduced to a mere 31 ms. That’s only 2% of the original 1.56 seconds!
Tone mapping Link to heading
Both the PBR Neutral operator and the linear to sRGB transfer function use conditional execution. Since both functions operate on packs of 4, 8, or even 16 values, I simply compute both sides of the condition and then merge the results depending on the outcome of the pipelineable comparison operation.
Results Link to heading
Before the optimizations the test image took more than 8 seconds to load. With the SIMD code paths and multithreading in place, the same image can be loaded in just 0.8 seconds.
Decoding of YCbCr planes with libheif is 248 ms. The YCbCr packing, normalization, color management, PQ transfer, and tone mapping pipeline is 355 ms. Resizing the image is 111 ms. Compression with zlib is 19 ms. Writing to the terminal is 127 ms.
In conclusion Link to heading
It turns out that viewing a random cat picture from the internet is quite an involved process. The amount of things you have to cover makes you quite interdisciplinary.
On the other hand, it’s not that hard. The original implementation of the loaders for my compositor took me maybe two or three days. It took me about 5 days to release the first version of vv, basically starting from scratch, with only the image loader library available, which I even extended with support for OpenEXR, TIFF and RAW images, along with tone mapping during those 5 days.
And yet I still can’t open a BC or ETC texture in virtually any other image viewer, somehow.
The next steps for vv would be to check the color management pipeline for correctness. Support for color profiles will need to be added to a lot of formats that still do not have it. I may have already announced that the JPEG XL loader supports it correctly, but it turns out that the code paths are not executed. It may need a similar treatment to libheif. Oh well.
Loading animated GIFs would be nice, but the libgif interface looks like it was designed for 16-bit computers and their limitations. And it probably was. Mouse cursors need to be ported to the general image loader interface. Adding support for Windows .ico icons would also be nice.
I will most likely want to add support for a few more tone mapping operators.
Then maybe do some work on the graphical viewer to make it usable and reliable enough. After all, vv is just a thin wrapper over the image loading library, which does all the heavy lifting.
Time will tell.
Of course, the more experience you have, the more you realize how little you actually know. ↩︎
It’s kind of obvious in retrospect, but in the long run it’s easier to find a thing that irritates you and solve it than it is to do some “for fun” side project that just withers away after a while. ↩︎
Both AVIF and HEIC use the HEIF container format. HEIF files can also store some other image formats. Sometimes HEIC and HEIF are used interchangeably, which only adds to the confusion. ↩︎
Yes, you can probably list a non-trivial number of programs that do support these formats, but then it turns out that you can’t set your desktop wallpaper to a JPEG XL file, or that the HEIC image from your iPhone can’t be uploaded to your web gallery, or a chat application, and at that point, why even bother when you end up having to do lossy recompression? ↩︎
Another interesting thing to consider is that we have been conditioned over the decades to expect certain types of compression artifacts in JPEG images. Modern codecs have the same data size constraints as before, they just make different decisions about where to allocate their bits. Combining these two observations, it’s not surprising that the blocky and ringing JPEG image may be perceived as better quality than a modern codec that eliminates the block structure and ringing, at the cost of more blurry smooth areas, such as fine skin or hair detail. ↩︎
There’s a small glimmer of hope here. All major browsers do support the AVIF image format. Most of the images on this blog are actually AVIF, as the size reduction is significant, with barely noticable quality reduction. ↩︎
I’ve seen some very spotty support for BC images in a small selection of programs. It was not usable in practice due to the number of issues encountered. The best way to view these images is either to use RenderDoc (which is a GPU debugging tool, not an image viewer), or a GUI to a texture compression utility, such as PVRTexTool, which, again, is not really an image viewer. ↩︎
I did some quick tests of different compression algorithms on PNG data, and the results are not pretty for zlib. People will complain that AI and cryptocurrency will burn the planet, but when you have simple solutions to reduce both the amount of image data transferred on the web and the time it takes to decode it, suddenly no one cares. It baffles me to no end. ↩︎
See also: Arithmetic encoding of JPEG files. It was a patent-encumbered algorithm, so no one wanted to use it. Nowadays, some programs can output arithmetically encoded JPEGs, but then you hit a wall, because something something, 10% size reduction is not worth it, Mozilla marks it as wontfix. “[Adding support] means we’ve created a fragmented market, since we’ll load images that no other browser does.” You can’t make this shit up. ↩︎
When a C library says it’s “about 60,000 lines of code you were going to write anyway,” consider how much of that budget is spent on things you take for granted in other languages: freeing memory, correct destruction of objects, checking for failed allocations (operator new throws), cleaning up already initialized data if something fails in the middle of construction, manual string manipulation, custom implementations of basic data structures like lists, vectors, and so on. ↩︎
Have you ever wondered why customizing the mouse cursor used to be common on Windows, but rarely seen on Linux? Yes, I’m sure it’s because Windows is for kids and playing games, unlike Linux, which is only for professional workstations. ↩︎
Sure, the 30 year old Windows 95 pointer arrow cursor uses only two colors, white and black, so it can be 1 bit per pixel. ↩︎
Why would you want to scale the mouse cursor up? There may be several reasons, and some of them may even be valid use cases. In this case, however, it’s completely a “Wayland is stupid” thing. The way it handles DPI scaling, to be exact, but let’s not go on yet another adventure with that. ↩︎
Nope, still not going into the Wayland side adventure. ↩︎
The browser will likely resize this image, which will distort how it’s presented. Moiré patterns may appear. Even at 1:1, your monitor may not be able to display this correctly, for example, there may be some sort of color tint, or the pixels may look like they are walking. What I’m showing here are random vertical and horizontal lines, and one diagonal line. ↩︎
I’ve been told on several occasions, by several people, that I see and hear more than other people. Which seems a bit absurd to me, since it’s the difference between doing something exactly right and just doing it so it seems okay. Like the whole “DPI scaling should only be integer because you can just downscale the 200% image to get the fractional scaling” super-broken thing that was the only way Wayland supported until very recently. What you got out of it was shimmer all over the place, which you can see on the vertical lines in the video below. ↩︎
And what if you’re using a font that doesn’t provide those Unicode blocks? Well, I don’t really care. Why would you want to selectively use modern software if your whole setup is antiquated? ↩︎
It was a bit surprising to me that to get the terminal size you have to issue an ioctl with the
TIOCGWINSZ
op code, which is handled by the kernel. Well, I guess terminals are really an ancient piece of technology. ↩︎To resize the image, I simply used the stb_image_resize library. No need to reinvent the wheel here, although it might be useful to find something that can be parallelized in the future. ↩︎
I was never sure when such advanced features would be available with the terminal I use. Most programs default to the safe 16-color palette, and you may need to set your
TERM
variable toxterm-256color
or something (how would that work in a terminal that is not xterm?). Then, when you run tmux, it may have different capabilities than your terminal of choice. It’s a hard thing to figure out. ↩︎ISO-8613-6 specifies that the separator here should be
:
, not;
, and that there is an additional Pi color space identifier (which should be ignored), but the version with the;
separators is used for Konsole compatibility, for some reason. ↩︎The compression step is optional. It feels like compressing the data would only add unnecessary delay, only to have the terminal immediately decompress it. But it turns out that writing data to the terminal can be unreasonably slow, so reducing the amount of data you have to transfer far outweighs the cost of compression. ↩︎
It is very fun when some program (cough tmux cough) goes against the spec and gives an invalid response, passing the optional parameter indicating sixel support in a message whose grammar doesn’t allow optional parameters. Sigh. ↩︎
As always, there is a lot of nuance here. Humans do not have a linear response to color stimuli, and the low energy photons are more frequent than the high energy ones (this is literally quantum physics), and my head is starting to hurt now. Look, all I need to know is that the maximum brightness of an HDR display is measured in nits, or candelas per square meter, and everything else is probably related to that. ↩︎
HDR support in browsers is very poor at the time of this article’s writing, and it’s likely that the image you see here is incorrect in one way or another. Which, when you think about it, is just another argument for making the HDR pipeline work as intended everywhere, isn’t it? ↩︎
Technically, they may have a “100%” size, and that can have its uses, but let’s not get sidetracked. ↩︎
Surprisingly, it’s not that easy to find good examples of HDR images on the web, especially if they were to cover all the possible combinations of things you have to do to process them correctly. I was able to find a reasonable amount of AVIF HDR images, but HEIC HDR is nowhere to be found. ↩︎
These are HDR images, and they don’t look quite right because they weren’t processed to be displayed on the SDR screen. It does not matter in this case. ↩︎
Okay, maybe not exactly the same, because you have to move and quantize already quantized values. But close enough to look the same! ↩︎
There are three color channels: red, green, and blue. The color primaries define where on the graph each channel is at its maximum. Since there are three such points, the resulting shape is a triangle. The location at which all of the channels are at their maximum level is defined by the white point. ↩︎
HDR displays make this look better, not only because they can display a higher dynamic range, but also because they work with a wider color gamut. ↩︎
Here’s how the CIE XYZ image is loaded by both GIMP and Krita: ↩︎
It’s still work in progress at this moment. ↩︎
Okay, fine, if you really insist, I might remember something. Like how there is no way to determine how many bits are used in RGB color channels. Or how the color profile is all wrong when RGB data is requested. No, it does not work. ↩︎
Using 32-bit RGBA even for three-channel RGB images makes sense because, at the cost of increased memory usage, we make the pixels easy to address and load by the CPU. With RGBA, the entire pixel can be loaded or stored with a single memory instruction, and the power-of-two alignment allows for easy (or even automatic) conversion to SIMD processing. With RGB, each channel would have to be loaded or stored separately, and that is not what we want to do. ↩︎
As a general rule. Existing implementations may use a semi-planar format where the Y data is stored in one plane and the UV components are stored interleaved in the second plane. Another example is the “YUV2” format, which packs 4:2:2 data with a pattern of
Y0
U
Y1
V
. ↩︎This saves us another headache about where the sample point is placed and how to interpolate the values correctly. ↩︎
This has its origins in a rather old Quicktime thing from 1999. ↩︎
The color primaries are provided by libheif in the
heif_color_profile_nclx
struct as values you can use directly, without having to write out a table for each enum value, so this was not relevant before. ↩︎As the color primaries in the nclx profile are only available as a limited selection of predefined values, it is possible that nclx contains only the best possible approximation of the ICC profile data. ↩︎
Four, actually, but the alpha channel has to go through its own path, so the result is discarded. ↩︎
For a more undestandable explanation, you can refer to the Game Engine Black Book: Wolfenstein 3D by Fabien Sanglard. ↩︎