The previous post about Tracy was quite interesting to some people. Let’s build on that and look at some more fascinating details. This time, we will take a tour of various internal systems to understand how data is processed to be presented to users.
Client side Link to heading
All of the events that Tracy displays come from the profiled application. So it seems reasonable to begin our journey by looking at what happens with the markup you insert into your program. How it is transformed into data that can be sent to the profiler.
At its core, the client part of Tracy is essentially a queue. More specifically, there are actually two queues. One is a lock-free concurrent queue, and the other is serialized.
Lock-free queue Link to heading
The lock-free queue, a modified version of Cameron Desrochers’ excellent moodycamel::ConcurrentQueue, allows for instantaneous collection of events as they happen. This is a “Multiple Producer” queue, which means that any two threads can write to it at the same time without having to synchronize with each other. This characteristic is very important if you want to minimize the performance impact on the profiled code1.
To see how the markup you write interacts with this queue, let’s take a look at how plots are handled. To draw a plot, the manual instructs you to use the TracyPlot
macro. This macro is expanded as follows:
#define TracyPlot(name, val) tracy::Profiler::PlotData(name, val)
So basically you are doing some function calls2 here. Here’s what the function you’re calling looks like:
inline void PlotData(const char* name, int64_t val)
{
TracyLfqPrepare(QueueType::PlotDataInt);
MemWrite(&item->plotDataInt.name, (uint64_t)name);
MemWrite(&item->plotDataInt.time, GetTime());
MemWrite(&item->plotDataInt.val, val);
TracyLfqCommit;
}
Both TracyLfqPrepare
and TracyLfqCommit
3 are macros used to start and finish adding a new queue item. Their existence is just an implementation detail. I won’t bore you with a listing of what they do (you can check it out directly in the code), but the gist of it is that both of these macros share a certain common state. If they were function calls instead, that state would have to be reloaded from memory at the place of TracyLfqCommit
. This is avoided by storing the state on the stack in the first macro and referencing those variables in the second macro4.
The MemWrite
template function is just a wrapper over memcpy
that uses type information to get the amount of data to copy without having to specify it manually.
Queue items Link to heading
The item
pointer is created on the stack by the TracyLfqPrepare
macro. It points to a data structure that we can fill directly in the queue. It contains the following fields:
struct QueueItem
{
QueueHeader hdr;
union
{
QueueThreadContext threadCtx;
QueueZoneBegin zoneBegin;
...
QueuePlotDataInt plotDataInt;
...
}
};
The QueueHeader
field is the following enumeration (ignore the mismatched types, it’s just another implementation detail of no importance):
enum class QueueType : uint8_t
{
ZoneText,
ZoneName,
Message,
...
PlotDataInt,
...
};
There are currently 111 different types of queue items, most of which represent some kind of event5.
With the header indicating the type, we know which data structure immediatelly follows in the QueueItem
. Here are some examples6:
struct QueueThreadContext
{
uint32_t thread;
};
struct QueueZoneBegin
{
int64_t time;
uint64_t srcloc; // ptr
};
struct QueuePlotDataInt
{
uint64_t name; // ptr
int64_t time;
int64_t val;
};
The data in QueueItem
is packed tightly with the #pragma pack(1)
macro. For this reason, MemWrite
must be used to set the data values in the above code. Some platforms that should be supported may not be able to handle unaligned access to variables, but memcpy
will always work.
All queue items are 32 bytes in size, with padding at the end if necessary. This may not seem like much, but it has been enough for all needs so far. Allocation of items in the queue is done in blocks and is thus amortized. Blocks in which all items have been dequeued are reused, reducing memory allocation pressure7.
Item data Link to heading
The PlotData
function writes three values into the QueuePlotDataInt
data structure (the header has already been set in the TracyLfqPrepare
call).
The name
field, stored as a uint64_t
, is actually a pointer to the string containing the plot name. This pointer is used as a unique identifier on the server side. The client will not attempt to send the contents of the string pointed to by this pointer to the server. If the server does not know what the pointer points to, it must explicitly ask the client for the value8.
The time
field contains a value returned by the GetTime
function. This can be the current value of the Time Stamp Counter register on x86, a CLOCK_MONOTONIC_RAW
reading, or any other high-resolution clock available on the machine, as long as the selected time source is used consistently.
The last field, val
, contains the plot value set by the user.
And that’s really all there is to it. While markup implementation for zones or locks may seem more complicated, it does essentially the same thing. It’s just wrapped up in a class or two to make it more convenient to use.
Serialized queue Link to heading
While a lock-free queue is quite fast, it has some drawbacks. For example, the order of items is only preserved within a thread. It is not possible to see how events are ordered if we need to look at two or more threads. And sometimes, in order for things to work properly, it is quite necessary to have this information. For example, you may have one thread allocating memory, and then the second thread freeing that memory. Under special circumstances9 it is possible that the first thread always gets the same memory address for each new allocation. If you now send this stream of allocations with the same address, the profiler won’t really be able to understand what is happening. While the free events may eventually arrive on the server side, the timestamp accuracy may not be enough to properly distinguish one event from another, and things go downhill pretty quickly from there. Note that since there is no synchronization between threads, you may also get the free events for allocations you don’t know about yet! The same applies to handling lock events10.
To enable serialization, a second queue is used that is protected by a mutex. Here’s an example of how it can be used, where we want to notify the server that we are waiting for some lock in the client application:
inline void BeforeLock()
{
auto item = Profiler::QueueSerial();
MemWrite(&item->hdr.type, QueueType::LockWait);
MemWrite(&item->lockWait.thread, GetThreadHandle());
MemWrite(&item->lockWait.id, m_id);
MemWrite(&item->lockWait.time, Profiler::GetTime());
Profiler::QueueSerialFinish();
}
Note that in this case we also need to store the thread handle in the queue item. This is something specific to the serialized queue, and we will take a closer look at it later.
Main profiler loop Link to heading
Now that we know how queue items are added, we need to take a look at how they are dequeued. During initialization, Tracy spools up its own worker thread where most of the processing is done11.
The main loop here periodically checks if there are any outstanding items in both queues. If there are, it starts processing the data. The QueueItem
structures, already tightly packed, are copied to a temporary buffer, without the unused padding at the end. When the buffer is full, it is compressed using the LZ4 algorithm and then sent to the server12.
Preprocessing data Link to heading
The LZ4 compression algorithm is extremely fast and can be very efficient if the data is prepared just right. To understand this, you need to be familiar with the principles of data compression. To make things simple, the compressed data stream basically contains two types of tokens:
- Literals, which are just sequences of bytes to output.
- Matches, which reference a byte sequence that was previously decompressed13.
Now let’s take a closer look at the plot item example we talked about earlier. The following data is required for each item:
- Plot name pointer.
- Plot value.
- Timestamp.
The name pointer, while random, is of no concern to us. It is unique to each plot and will appear repeatedly in the same form for each plot. Usually the number of plots you use is rather limited.
The plot value is completely up to the user, and it’s impossible to process it in any meaningful way because we simply don’t know what the next value might be.
The timestamp is much more interesting. The time resolution we are interested in is nanoseconds, and 2³² ns is only 4.29 seconds. The TSC register base frequency is typically even higher, and the lower 32 bits can rotate in 1.3 seconds. This is basically random, uncompressible data that needs to be sent as a literal.
To make things better, we can calculate deltas between consecutive timestamps and send that value instead. This process is then reversed on the server and the original values are restored. The data we have to compress will now contain many more zero bytes, which will increase the ratio of matches to literals, thus increasing compression efficiency14.
While I do not have the measurements for the savings in the network stream, I can tell how the same transformation affected the size of the traces saved to disk:
aa.tracy (0.4.1) {18987 KB} -> aa.tracy (0.4.2) {10140 KB} 53.40% size change
asset-new.tracy (0.4.1) {97163 KB} -> asset-new.tracy (0.4.2) {78402 KB} 80.69% size change
big.tracy (0.4.1) {2246014 KB} -> big.tracy (0.4.2) {943083 KB} 41.99% size change
darkrl.tracy (0.4.1) {31818 KB} -> darkrl.tracy (0.4.2) {15897 KB} 49.96% size change
long.tracy (0.4.1) {1583550 KB} -> long.tracy (0.4.2) {1225316 KB} 77.38% size change
q3bsp-mt.tracy (0.4.1) {1414323 KB} -> q3bsp-mt.tracy (0.4.2) {949855 KB} 67.16% size change
self.tracy (0.4.1) {399648 KB} -> self.tracy (0.4.2) {197713 KB} 49.47% size change
So I’d say it was pretty rewarding.
Thread optimization Link to heading
The lock-free queue used by Tracy stores data in per-thread buffers. When dequeuing items, data from multiple threads can get mixed up. We often need to know which thread the event came from, so some queue items need to store the thread identifier. We saw this in the serialized queue example.
With a simple change to the lock-free queue, we can restrict each dequeue operation to take place within a single thread. We can also extract the buffer thread identifier stored internally in the queue and use it to notify the server of the current thread context. This is done with the ThreadContext
event, which indicates that all subsequent events will be within the specified thread. When we do things this way, we no longer need to store the thread identifier in the queue items. This makes enqueuing a tiny bit faster because there’s one less item to write. It reduces the size of the data stream because there’s no need to include the thread id over and over again. Finally, it frees up space in the queue items for additional data if it’s needed.
This optimization does not make sense in the serialized queue, which by design contains intermixed items from multiple threads.
Fat queue items Link to heading
Some queue events have a “fat” variant. For example, consider the following message for submitting a lock name:
struct QueueLockName
{
uint32_t id;
};
struct QueueLockNameFat : public QueueLockName
{
uint64_t name; // ptr
uint16_t size;
};
In this case, only the QueueLockName
event will ever appear on the wire. The QueueLockNameFat
message can only be used internally by the client for storage of the lock name string data, which is allocated separately. When the client is about to send the QueueLockName
event, it will decode the “fat” appendix and send a special type of message indicating a string transfer, SingleStringData
. This is immediately followed by the raw string data. The server stores this data in a temporary buffer for use when the actual lock name event arrives.
Server queries Link to heading
As mentioned earlier, the client part of Tracy uses pointers to data (e.g., plot names, source locations, etc.) as identifiers. It does not keep track of what the server knows. The server must send a query to the client when it receives an identifier it hasn’t seen before. Handling these queries is part of the profiler’s main loop. Currently used server queries are listed below:
enum ServerQuery : uint8_t
{
ServerQueryTerminate,
ServerQueryString,
ServerQueryThreadString,
ServerQuerySourceLocation,
ServerQueryPlotName,
ServerQueryFrameName,
ServerQueryParameter,
ServerQueryFiberName,
// Items above are high priority. Split order must be preserved. See IsQueryPrio().
ServerQueryDisconnect,
ServerQueryCallstackFrame,
ServerQueryExternalName,
ServerQuerySymbol,
ServerQuerySymbolCode,
ServerQuerySourceCode,
ServerQueryDataTransfer,
ServerQueryDataTransferPart
};
Most names are self-explanatory. It’s worth noting that these queries implement a primitive form of Quality of Service. This is necessary in order to prevent long-running queries from blocking access to data that needs to be retrieved quickly and that is essential for the profiler to actually be able to be used.
The following data structure is used to send queries by the server. The number of queries is relatively small, so the data stream is not compressed.
struct ServerQueryPacket
{
ServerQuery type;
uint64_t ptr;
uint32_t extra;
};
The ptr
field usually contains the data pointer that the client sent as an identifier. The extra
field is rarely used. Currently only the ServerQuerySymbolCode
query uses it to send the symbol size, as it is not kept by the client.
The ServerQueryDataTransfer
and ServerQueryDataTransferPart
messages are seldom used to send long strings of data15. Currently, this functionality is used to transfer the names of files that the server wants the client to send back.
Server side Link to heading
To recap the previous post, the server part of Tracy has to handle large amounts of data. It does this by tightly packing the data structures and using custom sized data types, such as Int24
, Int48
, or the 6-byte pointers.
Custom data types Link to heading
Special care has been taken to make these types work as efficiently as possible. For example, the following code is used to convert an Int48
value back to int64_t
:
uint8_t m_val[6];
int64_t Val() const
{
int16_t hi;
memcpy(&hi, m_val+4, 2);
uint32_t lo;
memcpy(&lo, m_val, 4);
return (int64_t( uint64_t(hi) << 32)) | lo;
}
This results in the following assembly on clang, gcc, and msvc:
movsx rax, word ptr [rdi + 4]
mov ecx, dword ptr [rdi]
shl rax, 32
or rax, rcx
ret
A naive implementation16 of the above code would look like this:
int64_t Val() const
{
int64_t v = 0;
memcpy(&v, m_val, 6);
return v;
}
While clang is able to compile this to similarly optimal code, both gcc and msvc spill onto the stack for some reason:
mov eax, DWORD PTR [rdi]
mov QWORD PTR [rsp-8], 0
mov DWORD PTR [rsp-8], eax
movzx eax, WORD PTR [rdi+4]
mov WORD PTR [rsp-4], ax
mov rax, QWORD PTR [rsp-8]
ret
Vectors Link to heading
Tracy implements a custom variant of std::vector
that takes only 11 bytes. This is quite an improvement over the 24 bytes that the STL version usually takes.
template<typename T>
class Vector
{
short_ptr<T> m_ptr;
uint32_t m_size;
uint8_t m_capacity : 7;
uint8_t m_magic : 1;
};
The capacity is expressed in powers of two. The magic bit indicates whether the vector contains pointers or the raw data, as discussed in the previous post.
This vector can optionally be extended to a SortedVector
variant that knows if its contents are sorted or not. If not, it can be sorted on demand using the in-place merge sort algorithm, which takes advantage of the fact that some (often significant) part of the vector is already sorted.
Event dispatch Link to heading
The main worker loop reads data from the network17 and dispatches the processing of events depending on the value of QueueHeader
. While the events are being processed, a global lock on the profiler data is held. The UI rendering thread also competes for this lock.
Dispatch is basically a big switch that calls the appropriate handler functions.
void Worker::Process(const QueueItem& ev)
{
switch(ev.hdr.type)
{
case QueueType::ThreadContext:
ProcessThreadContext(ev.threadCtx);
break;
case QueueType::ZoneBegin:
ProcessZoneBegin(ev.zoneBegin);
break;
...
}
}
Processing zones Link to heading
The following code (slightly simplified) is executed when a QueueZoneBegin
event is received:
void Worker::ProcessZoneBegin(const QueueZoneBegin& ev)
{
auto zone = m_slab.Alloc<ZoneEvent>();
CheckSourceLocation(ev.srcloc);
const auto start = TscTime(RefTime(m_refTimeThread, ev.time));
zone->SetStartSrcLoc(start, ShrinkSourceLocation(ev.srcloc));
zone->SetEnd(-1);
zone->SetChild(-1);
NewZone(zone);
}
To provide memory for storing zone data, the custom block allocator m_slab
is used. The CheckSourceLocation
function checks if the ev.srcloc
identifier (a client-side pointer) has been seen before. If not, an empty entry is created for it and a ServerQuerySourceLocation
is sent to the client. The eventual response message is used to fill in the blanks in the just created source location entry. The function and filename in this response are themselves pointer identifiers, so additional queries may need to be sent by the server18.
The timestamp is converted from the delta to the actual value (RefTime
) and then converted to the nanosecond range (TscTime
)19. The 64-bit pointer source location identifier is compressed to 16 bits in the ShrinkSourceLocation
function20. Both values are stored in the zone by the SetStartSrcLoc
function21. Then the end timestamp is set to an invalid value and the child vector index is set to indicate that the zone has no children.
The NewZone
function does a little bit of accounting to take care of the nested structure of the zones:
void Worker::NewZone(ZoneEvent* zone)
{
auto td = GetCurrentThreadData();
const auto ssz = td->stack.size();
if(ssz == 0)
{
td->stack.push_back(zone);
td->timeline.push_back(zone);
}
else
{
auto& back = td->stack.data()[ssz-1];
if(!back->HasChildren())
{
back->SetChild(int32_t(m_data.zoneChildren.size()));
m_data.zoneChildren.push_back(Vector<short_ptr<ZoneEvent>>(zone));
}
else
{
const auto backChild = back->Child();
m_data.zoneChildren[backChild].push_back_non_empty(zone);
}
td->stack.push_back_non_empty(zone);
}
}
We start by getting the current thread data td
. This was set with the QueueThreadContext
event. Then, if the zone stack of the thread is empty, we add the zone as the top-level element on the timeline. If there are already zones on the stack, we get the last zone, back
, and add the currently handled zone to the vector of children of the parent zone. Note that child vectors are stored externally, so two more bytes can be saved in each zone by using only a 32-bit child index22. Finally, regardless of the branch taken, the current zone is put on the stack23.
Some code has been omitted here for brevity, most notably the code for caching temporary child vectors. Why are the vectors temporary? Because it is more efficient to use the “magic” bit and store data directly in the vector, without the pointer indirection (note that the data type is short_ptr<ZoneEvent>
). But this only works reasonably well up to a certain size of the vector.
Ending zones Link to heading
The data structure for announcing the end of a zone is very simple.
struct QueueZoneEnd
{
int64_t time;
};
It is handled in the following way:
void Worker::ProcessZoneEnd(const QueueZoneEnd& ev)
{
auto td = GetCurrentThreadData();
auto zone = td->stack.back_and_pop();
const auto timeEnd = TscTime(RefTime(m_refTimeThread, ev.time));
zone->SetEnd(timeEnd);
}
Again, a lot of code is not shown here, but the most important things are. We need to do two things: set the zone end time and remove it from the stack. And that’s it.
Wrapping things up Link to heading
So there you have it, a complete description of how the markup you make in the profiled application ends up as usable data in the Tracy UI. We’ve already seen how this data is searched for things to draw in the previous post, so go read that if you haven’t already.
The impact is so small that some developers ship their games with Tracy always enabled24. ↩︎
Macros are used to allow complete removal of profiler code from the application with just one compile-time define. ↩︎
“Lfq” stands for lock-free queue. ↩︎
You might think that the problem could be solved by simply inlining the two queue handling functions instead. The compiler will then see that the same memory location is being read twice and reduce it to just one read. Unfortunately, it is not that simple. There may be some intermediate memory writes that the compiler has to assume may be aliased to the memory location being read. So it has to read the value a second time, even if we are absolutely sure that it will not change. Hence the need for macro trickery. ↩︎
The order in this enumeration is important because it divides things into three groups, some of which may require special processing. For example, you may need to free some memory that contains the string you are sending, or you may need to perform timestamp processing. Determining which group the item belongs to by doing a simple numeric comparison on the group boundary in a dispatch function is much faster than using a big switch. ↩︎
These are simplified to hide some of the implementation details involved. Things such as
QueuePlotDataInt
enabling just one of the possible data transfer options, withfloat
anddouble
being the others. A small amount of inheritance is used to extract the common data fields, which is omitted here. ↩︎For memory allocation purposes, Tracy uses a fast custom allocator, rpmalloc. ↩︎
Remember that the plot name identifier (pointer) is sent with each new plot item. It is only when the first item arrives that the server does not know the pointed-to content. ↩︎
Which may happen more often than you think. ↩︎
Initially, lock data was captured asynchronously, and the timeline was reconstructed during rendering on the server. However, it turned out that on some platforms, due to the accuracy of the timestamps, it was not possible to determine which thread actually got the lock. This was fixed with serialization, but the lock events are still handled by the server in the old way. It can be rewritten to be orders of magnitude faster. ↩︎
There may be a few other worker threads that are used for more specialized, potentially blocking tasks that should not interfere with the execution of the main profiler thread. These include: the system tracing thread, the image compression thread, and the thread used to process symbol information. ↩︎
It is faster to compress the data and send a small number of bytes than it is to send a large number of bytes that are not compressed. ↩︎
In the case of the LZ4 algorithm, the data window from which matches can be copied is 64 KB. ↩︎
Quantization of timestamp values should also be considered. It’s quite difficult for the hardware to measure things at the nanosecond level, especially if there is a need to cover multi-core, multi-package, or multi-socket systems. For example, current AMD hardware will report TSC values quantized to 10 ns. This measurement contradicts the hardware manual, but has been confirmed by AMD engineers as an implementation compromise. Quantized time deltas are much easier to encode than full timestamps, where the quantization is essentially hidden, unless we need to send two identical timestamps. ↩︎
Each request can only transfer 12 bytes of data. While this is inefficient, it makes it easier to calculate how much data can be placed in the network buffer25. ↩︎
This is not an equivalent because it doesn’t do the sign extension, but it’s close enough to show what the problem is. ↩︎
A separate thread is used to read data from the network and decompress it. ↩︎
Note that you can have multiple functions in a single file. The filename identifier will be identical in each source location and only needs to be queried once. The function names are unique, and a separate query is required for each of them. ↩︎
As mentioned earlier, the client may be taking readings in the TSC register base clock domain, so a conversion may be necessary. For performance reasons this is only done on the server. ↩︎
The term “compressed” can be a bit misleading. You may be now thinking of some bit shuffling patterns that would somehow make it possible to reduce 64 bits to a quarter of the size without the possibility of conflicts. That would be a red herring. To “decompress” 16-bit values back to 64 bits, you would use a
uint64_t[65536]
array. The “compression” is done by simply keeping track of the array occupancy, with an additional unordered map to avoid duplicates. ↩︎It so happens that both the start timestamp and the 16-bit source location are stored in a single
uint64_t
field. It is better to pack these two in registers and store them with a single write than to use two functions that would have to read the existing value from memory, update only some of the bits, and then write it back. Implementation details, again. ↩︎Zones without children are actually in the majority. It was rather counterintuitive to me when I found out that only about 15% of the zones have children. ↩︎
Another implementation detail: the
push_back_non_empty
method ofVector
does not check if the vector is empty, since we already checked that. This saves a branch. ↩︎It may have to be enabled by some sort of command-line parameter or internal console switch, but the code is there, ready to do the job. ↩︎
When you send data over the network, the kernel will only allow a limited number of bytes in the send buffer. If this limit is exceeded, the program execution is blocked. The other side must send a notification that the previous data has been received before more data can be sent over the wire. So if both sides of the connection wanted to send large amounts of data, they would lock each other out, each waiting for the other side to read something. For this to work properly, at least one side must limit the amount of data it wants to send. ↩︎