Instead of encoding a protobuf directly into the thread local buffer, we could just record a simple struct of events, and generate the protobuf when flushing to the file.
This would reduce overhead in the tracing functions, but would cause flushes to be slower instead. There are pros and cons to this.