That was difficult to read, smelt very AI assisted though the message was worthwhile, it could've been shorter and more to the point.
A few things I've been thinking about recently:
- we have authentication everywhere in our stack, so I've started including the user id on every log line. This makes getting a holistic view of what a user experienced much easier.
- logging an error as a separate log line to the request log is a pain. You can filter for the trace, but it makes it hard to surface "show me all the logs for 5xx requests and the error associated" - it's doable, but it's more difficult than filtering on the status code of the request log
- it's not enough to just start including that context, you have to educate your coworkers that it's now present. I've seen people making life hard for themselves because they didn't realize we'd added this context
A post on this topic feels incomplete without a shout-out to Charity Majors - she has been preaching this for a decade, branded the term "wide events" and "observability", and built honeycomb.io around this concept.
Also worth pointing out that you can implement this method with a lot of tools these days. Both structured Logs or Traces lend itself to capture wide events. Just make sure to use a tool that supports general query patterns and has rich visualizations (time-series, histograms).
> A post on this topic feels incomplete without a shout-out to Charity Majors
I concur. In fact, I strongly recommend anyone who has been working with observability tools or in the industry to read her blog, and the back story that lead to honeycomb. They were the first to recognize the value of this type of observability and have been a huge inspiration for many that came after.
Could you drop a few specific posts here that you think are good for someone (me) who hasn't read her stuff before? Looks like there's a decade of stuff on her blog and I'm not sure I want to start at the very beginning...
I've learned more from Charity about telemetry than from anyone else. Her book is great, as are her talks and blog posts. And Honeycomb, as a tool, is frankly pretty amazing
> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally.
I worked with enterprise message bus loggers in semiconductor manufacturing context wherein we had thousands of participants on the message bus. It generated something like 300-400 megabytes per hour. Despite the insane volume we made this work really well using just grep and other basic CLI tools.
The logs were mere time series of events. Figuring out the detail about specific events (e.g. a list of all the tools a lot visited) required writing queries into the Oracle monster. You could derive history from the event logs if you had enough patience & disk space, but that would have been very silly given the alternative option. We used them predominantly to establish a casual chain between events when the details are still preliminary. Identifying suspects and such. Actually resolving really complicated business usually requires more than a perfectly detailed log file.
The presentation is fantastic and I loved the interactive examples!
Too bad that all of this effort is spent arguing something which can be summarised as "add structured tags to your logs"
Generally speaking my biggest gripe with wide logs (and other "innovative" solutions to logging) is that whatever perceived benefit you argue for doesn't justify the increased complexity and loss of readability.
We're throwing away `grep "uid=user-123" application.log` to get what? The shipping method of the user attached to every log? Doesn't feel an improvement to me...
P.S. The checkboxes in the wide event builder don't work for me (brave - android)
Horrid advice at the end about logging every error, exception, slow request, etc if you are sampling healthy requests.
Taking slow requests as an example, a dependency gets slower and now your log volume suddenly goes up 100x. Can your service handle that? Are you causing a cascading outage due to increased log volumes?
Recovery is easier if your service is doing the same or less work in a degraded state. Increasing logging by 20-100x when degraded is not that.
I do not see how logging could bottleneck you in a degraded state unless your logging is terribly inefficient. A properly designed logging system can record on the order of 100 million logs per second per core.
Are you actually contemplating handling 10 million requests per second per core that are failing?
Yea that was my thought too. I like the idea in principle, but these magic thresholds can really bite you. It claims to be P(99), probably off some historical measurement, but that's only true if it's dynamically changing. Maybe this could periodically query the OTEL provider for the real number to at least limit the time window of something bad happening.
My impression was that you would apply this filter after the logs have reach your log destination, so there should be no difference for your services unless you host your own log infra, in which case there might be issues on that side. At least that's how we do it with Datadog because ingestion is cheap but indexing and storing logs long term is the expensive part.
One thing this is missing: Standardization and probably the ECS' idea of "related" fields.
A common problem in a log aggregation is the question if you query for user.id, user_id, userID, buyer.user.id, buyer.id, buyer_user_id, buyer_id, ... Every log aggregation ends up being plagued by this. You need standard field names there, or it becomes a horrible mess.
And for a centralized aggregation, I like ECS' idea of "related". If you have a buyer and a seller, both with user IDs, you'd have a `related.user.id` with both id's in there. This makes it very simple to say "hey, give me everything related to request X" or "give me everything involving user Y in this time frame" (as long as this is kept up to date, naturally)
I actually wrote my bachelors on this topic, but instead of going the ECS route (which still has redundant fields in different components) I went in the RDF direction. That system has shifted towards more of a middleware/database hybrid (https://github.com/triblespace/triblespace-rs). I always wonder if we actually need logging if we had more data-oriented stacks where the logs fall out as a natural byproduct of communication and storage.
I always wondered why we didnt have some kind of fuzzy english words search regexes/tool, that is robust to keyboard typing mistakes, spelling mistake, synonyms, plural, conjugation etc.
I agree with this statement: "Instead of logging what your code is doing, log what happened to this request." but the impression I can't shake is that this person lacks experience, or more likely has a lot of experience doing the same thing over and over.
"Bug parts" (as in "acceptable number of bug parts per candy bar") logging should include the precursors of processing metrics. I think what he calls "wide events" I call bug parts logging in order to emphasize that it also may include signals pertaining to which code paths were taken, how many times, and how long it took.
Logging is not metrics is not auditing. In particular processing can continue if logging (temporarily) fails but not if auditing has failed. I prefer the terminology "observables" to "logging" and "evaluatives" to "metrics".
In mature SCADA systems there is the well-worn notion of a "historian". Read up on it.
A fluid level sensor on CANbus sending events 10x a second isn't telling me whether or not I have enough fuel to get to my destination (a significant question); however, that granularity might be helpful for diagnosing a stuck sensor (or bad connection). It would be impossibly fatiguing and hopelessly distracting to try to answer the significan question from this firehose of low-information events. Even a de-noised fuel gauge doesn't directly diagnose my desired evaluative (will I get there or not?).
Does my fuel gauge need to also serve as the debugging interface for the sensor? No, it does not. Likewise, send metrics / evaluatives to the cloud not logging / observables; when something goes sideways the real work is getting off your ass and taking a look. Take the time to think about what that looks like: maybe that's the best takeaway.
I espouse a "grand theory of observability" that, like matter and energy, treats logs, metrics, and audits alike. At the end of the day, they're streams of bits, and so long as no fidelity is lost, they can be converted between each other. Audit trails are certainly carried over logs. Metrics are streams of time-series numeric data; they can be carried over log channels or embedded inside logs (as they often are).
How these signals are stored, transformed, queried, and presented may differ, but at the end of the day, the consumption endpoint and mechanism can be the same regardless of origin. Doing so simplifies both the conceptual framework and design of the processing system, and makes it flexible enough to suit any conceivable set of use cases. Plus, storing the ingested logs as-is in inexpensive long-term archival storage allows you to reprocess them later however you like.
Saying they are all the same when no fidelity is lost is missing the point. The only distinction between logs, traces, and metrics is literally what to do when fidelity is lost.
If you have insufficient ingestion rate:
Logs are for events that can be independently sampled and be coherent. You can drop arbitrary logs to stay within ingestion rate.
Traces are for correlated sequences of events where the entire sequence needs to be sampled to be coherent. You can drop arbitrary whole sequences to stay within ingestion rate.
Metrics are pre-aggregated collections of events. You pre-limited your emission rate to fit your ingestion rate at the cost of upfront loss of fidelity.
If you have adequate ingestion rate, then you just emit your events bare and post-process/visualize your events however you want.
> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue. Your logs are still acting like it's 2005.
Logs are fine. The job of local logs is to record the talk of a local process. They are doing this fine. Local logs were never meant to give you a picture of what's going on some other server. For such context, you need a transaction tracing that can stitch the story together across all processes involved.
Usually, looking at the logs at right place should lead you to the root cause.
One of the points the author is trying to make (although he doesn't make it well, and his attitude makes it hard to read) is that logs aren't just for root-causing incidents.
When properly seasoned with context, logs give you useful information like who is impacted (not every incident impacts every customer the same way), correlations between component performance and inputs, and so forth. When connected to analytical engines, logs with rich context can help you figure out things like behaviors that lead to abandonment, the impact of security vulnerability exploits, and much more. And in their never-ending quest to improve their offerings and make more money, product managers love being able to test their theories against real data.
The problem statement in this article sounds weird. I thought in 2025 everyone logs at least thread id and context id (user id, request id etc), and in microservice architecture at least transaction or saga id. You don’t need structured logging, because grep by this id is sufficient for incident investigation. And for analytics and metrics databases of events and requests make more sense.
But does it? Or is it bad logging, or excessive logging, or unsearchable logs?
A client of mine uses SnapLogic, which is a middleware / ETL that's supposed run pipelines in batch mode to pass data around between systems. It generates an enormous amount of logs that are so difficult to access, search and read that they may as well don't exist.
We're replacing all of that with simple Python scripts that do the same thing and generate normal simple logs with simple errors when something's truly wrong or the data is in the wrong format.
Terse logging is what you want, not an exhaustive (and exhausting) torrent of irrelevant information.
AI slop blogvert. The first example is disingenuous btw. Everyone these days uses requestIDs to be able to query all log lines emanated by a single request, usually set by the first backend service to receive the request and then propagated using headers (and also set in the server response).
There isn't anything radical about his proposed solutions either. Most log storage can be set with a rule where all warning logs or above can be retained, but only a sample of info and debug logs.
Nothing in this article is something that most major software companies don't already do.
Comments in this thread reinforce that if you haven't worked at a Big Tech company, there is a big chunk of random knowledge that you are missing that would round you out as a software engineer. Nothing in TFA should surprise someone with "FAANG" on their resume.
The "key insight" is also flawed. The reason why we log at every step is because sometimes your request never completes and it could be for 1000 reasons but you really need to know how far it got in your system. Logging only at the end is happy path thinking.
You might also need different systems for low-cardinality, low-latency production monitoring (where you want to throw alerts quickly and high cardinality fields would just get in the way), and medium to long term logging with wide events.
Also if you're going to log wide events, for the sake of the person querying them after you, please don't let your schema be an ad hoc JSON dict of dicts, put some thought into the schema structure (and better have a logging system that enforces the schema).
Tangential, but I wonder if the given example might be straying a step too far? Normally we want to keep sensitive data out of logs, but the example includes a user.lifetime_value_cents field. I'd want to have a chat with the rest of the business before sticking something like that in logs.
Gonna go on a tangent here. Why the single purpose domain? Especially since the author has a blog. My blog is full of links to single post domains that are no longer.
From what I gather: This is referring to Web sites or other HTTP applications which are internally implemented as a collection of separate applications/ micro-services?
> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue. Your logs are still acting like it's 2005.
If a user request is hitting that many things, in my view, that is a deeply broken architecture.
> If a user request is hitting that many things, in my view, that is a deeply broken architecture.
If we want it or not, a lot of modern software looks like that. I am also not a particular fan of building software this way, but it's a reality we're facing. In part it's because quite a few services that people used to build in-house are now outsourced to PaaS solutions. Even basic things such as authentication are more and more moving to third parties.
The reason we end up with very complex systems I don't think is because of incentives between "managers and technicians". If I were to put my finger to it, I would assume it's the very technicians who argued themselves into a world where increased complexity and more dependencies is seen as a good thing.
The framing is not, though. Why does it have to sound so dramatic and provocative? It’s insulting to its audience. Grumpiness, in the long term, is a career-limiting attitude.
I've generally found that structured logs that include a correlation ID make it quite easy to narrow down the general area or exact cause of problems. Usually (in enterprise orgs) via Splunk or Datadog.
Where I've had problems it's usually been one of:
There wasn't anything logged in the error block. A comment saying "never happens" is often discovered later :)
Too much was logged and someone mandated dialing the logging down to save costs. Sigh.
A new thread was started and the thread-local details including the correlation ID got lost, then the error occurred downstream of that. I'd like better solutions for that one.
Edit: Incidentally a correlation ID is not (necessarily) the same thing as a request ID. An API often needs to allow for the caller making multiple calls to achieve an objective; 5 request IDs might be tied to a single correlation ID.
I see more and more blog posts that contain interactive elements. Despite the general enshittification of the average blog and the internet, this feels like a 'modern' touch that actually adds something valuable to the sufficient ad-free no-popups old blog style.
This article is attacking a strawman. It makes up terrible logs and then says they are bad. Even if this was a single monolith the logs still don't include even something like a thread id, to avoid mixing different requests together.
That doesn't sound like a good plan. You're coupling logging with business logic. I don't want to have to think if i change a debug string am i going to break something.
Slapping on OpenTelemetry actually will solve your problem.
Point #1 isn't true, auto instrumentation exists and is really good. When I integrate OTel I add my own auto instrumentors wherever possible to automatically add
lots of context. Which gets into point #2.
Point #2 also isn't true. It can add business context in a
hierarchal manner and ship wide events. You shouldn't have to tell every span all the information again. Just where it appears naturally the first time.
Point #3 also also isn't true because OTel libs make it really annoying to just write a log message and very strongly pushes you into a hierarchy of nested context managers.
Like the author's ideal
setup is basically using OTel
with Honeycomb. You get the querying and everything. And unlike rawdogging wide events all your traces are connected, can span multiple services and do timing for you.
A few things I've been thinking about recently:
- we have authentication everywhere in our stack, so I've started including the user id on every log line. This makes getting a holistic view of what a user experienced much easier.
- logging an error as a separate log line to the request log is a pain. You can filter for the trace, but it makes it hard to surface "show me all the logs for 5xx requests and the error associated" - it's doable, but it's more difficult than filtering on the status code of the request log
- it's not enough to just start including that context, you have to educate your coworkers that it's now present. I've seen people making life hard for themselves because they didn't realize we'd added this context
Also worth pointing out that you can implement this method with a lot of tools these days. Both structured Logs or Traces lend itself to capture wide events. Just make sure to use a tool that supports general query patterns and has rich visualizations (time-series, histograms).
I concur. In fact, I strongly recommend anyone who has been working with observability tools or in the industry to read her blog, and the back story that lead to honeycomb. They were the first to recognize the value of this type of observability and have been a huge inspiration for many that came after.
Yep, I'm a fan.
I worked with enterprise message bus loggers in semiconductor manufacturing context wherein we had thousands of participants on the message bus. It generated something like 300-400 megabytes per hour. Despite the insane volume we made this work really well using just grep and other basic CLI tools.
The logs were mere time series of events. Figuring out the detail about specific events (e.g. a list of all the tools a lot visited) required writing queries into the Oracle monster. You could derive history from the event logs if you had enough patience & disk space, but that would have been very silly given the alternative option. We used them predominantly to establish a casual chain between events when the details are still preliminary. Identifying suspects and such. Actually resolving really complicated business usually requires more than a perfectly detailed log file.
Too bad that all of this effort is spent arguing something which can be summarised as "add structured tags to your logs"
Generally speaking my biggest gripe with wide logs (and other "innovative" solutions to logging) is that whatever perceived benefit you argue for doesn't justify the increased complexity and loss of readability.
We're throwing away `grep "uid=user-123" application.log` to get what? The shipping method of the user attached to every log? Doesn't feel an improvement to me...
P.S. The checkboxes in the wide event builder don't work for me (brave - android)
Taking slow requests as an example, a dependency gets slower and now your log volume suddenly goes up 100x. Can your service handle that? Are you causing a cascading outage due to increased log volumes?
Recovery is easier if your service is doing the same or less work in a degraded state. Increasing logging by 20-100x when degraded is not that.
Are you actually contemplating handling 10 million requests per second per core that are failing?
Besides, a little local on-disk buffering goes a long way, and is cheap to boot. It’s an antipattern to flush logs directly over the network.
A common problem in a log aggregation is the question if you query for user.id, user_id, userID, buyer.user.id, buyer.id, buyer_user_id, buyer_id, ... Every log aggregation ends up being plagued by this. You need standard field names there, or it becomes a horrible mess.
And for a centralized aggregation, I like ECS' idea of "related". If you have a buyer and a seller, both with user IDs, you'd have a `related.user.id` with both id's in there. This makes it very simple to say "hey, give me everything related to request X" or "give me everything involving user Y in this time frame" (as long as this is kept up to date, naturally)
"Bug parts" (as in "acceptable number of bug parts per candy bar") logging should include the precursors of processing metrics. I think what he calls "wide events" I call bug parts logging in order to emphasize that it also may include signals pertaining to which code paths were taken, how many times, and how long it took.
Logging is not metrics is not auditing. In particular processing can continue if logging (temporarily) fails but not if auditing has failed. I prefer the terminology "observables" to "logging" and "evaluatives" to "metrics".
In mature SCADA systems there is the well-worn notion of a "historian". Read up on it.
A fluid level sensor on CANbus sending events 10x a second isn't telling me whether or not I have enough fuel to get to my destination (a significant question); however, that granularity might be helpful for diagnosing a stuck sensor (or bad connection). It would be impossibly fatiguing and hopelessly distracting to try to answer the significan question from this firehose of low-information events. Even a de-noised fuel gauge doesn't directly diagnose my desired evaluative (will I get there or not?).
Does my fuel gauge need to also serve as the debugging interface for the sensor? No, it does not. Likewise, send metrics / evaluatives to the cloud not logging / observables; when something goes sideways the real work is getting off your ass and taking a look. Take the time to think about what that looks like: maybe that's the best takeaway.
I espouse a "grand theory of observability" that, like matter and energy, treats logs, metrics, and audits alike. At the end of the day, they're streams of bits, and so long as no fidelity is lost, they can be converted between each other. Audit trails are certainly carried over logs. Metrics are streams of time-series numeric data; they can be carried over log channels or embedded inside logs (as they often are).
How these signals are stored, transformed, queried, and presented may differ, but at the end of the day, the consumption endpoint and mechanism can be the same regardless of origin. Doing so simplifies both the conceptual framework and design of the processing system, and makes it flexible enough to suit any conceivable set of use cases. Plus, storing the ingested logs as-is in inexpensive long-term archival storage allows you to reprocess them later however you like.
If you have insufficient ingestion rate:
Logs are for events that can be independently sampled and be coherent. You can drop arbitrary logs to stay within ingestion rate.
Traces are for correlated sequences of events where the entire sequence needs to be sampled to be coherent. You can drop arbitrary whole sequences to stay within ingestion rate.
Metrics are pre-aggregated collections of events. You pre-limited your emission rate to fit your ingestion rate at the cost of upfront loss of fidelity.
If you have adequate ingestion rate, then you just emit your events bare and post-process/visualize your events however you want.
Logs are fine. The job of local logs is to record the talk of a local process. They are doing this fine. Local logs were never meant to give you a picture of what's going on some other server. For such context, you need a transaction tracing that can stitch the story together across all processes involved.
Usually, looking at the logs at right place should lead you to the root cause.
When properly seasoned with context, logs give you useful information like who is impacted (not every incident impacts every customer the same way), correlations between component performance and inputs, and so forth. When connected to analytical engines, logs with rich context can help you figure out things like behaviors that lead to abandonment, the impact of security vulnerability exploits, and much more. And in their never-ending quest to improve their offerings and make more money, product managers love being able to test their theories against real data.
Not if I have anything to say about it.
>Your logs are still acting like it's 2005.
Yeah, because that's just before software development went absolutely insane.
But does it? Or is it bad logging, or excessive logging, or unsearchable logs?
A client of mine uses SnapLogic, which is a middleware / ETL that's supposed run pipelines in batch mode to pass data around between systems. It generates an enormous amount of logs that are so difficult to access, search and read that they may as well don't exist.
We're replacing all of that with simple Python scripts that do the same thing and generate normal simple logs with simple errors when something's truly wrong or the data is in the wrong format.
Terse logging is what you want, not an exhaustive (and exhausting) torrent of irrelevant information.
I do think "logs are broken" is a bit overstated. The real problem is unstructured events + weak conventions + poor correlation.
Brilliant write up regardless
There isn't anything radical about his proposed solutions either. Most log storage can be set with a rule where all warning logs or above can be retained, but only a sample of info and debug logs.
Nothing in this article is something that most major software companies don't already do.
Comments in this thread reinforce that if you haven't worked at a Big Tech company, there is a big chunk of random knowledge that you are missing that would round you out as a software engineer. Nothing in TFA should surprise someone with "FAANG" on their resume.
The "key insight" is also flawed. The reason why we log at every step is because sometimes your request never completes and it could be for 1000 reasons but you really need to know how far it got in your system. Logging only at the end is happy path thinking.
Also if you're going to log wide events, for the sake of the person querying them after you, please don't let your schema be an ad hoc JSON dict of dicts, put some thought into the schema structure (and better have a logging system that enforces the schema).
Gonna go on a tangent here. Why the single purpose domain? Especially since the author has a blog. My blog is full of links to single post domains that are no longer.
i do not see a product upsell anywhere.
if it's an ad for the author themselves, then it's a very good one.
If a user request is hitting that many things, in my view, that is a deeply broken architecture.
If we want it or not, a lot of modern software looks like that. I am also not a particular fan of building software this way, but it's a reality we're facing. In part it's because quite a few services that people used to build in-house are now outsourced to PaaS solutions. Even basic things such as authentication are more and more moving to third parties.
Yes. Most software is bad
The incentives between managers and technicians are all wrong
Bad software is more profitable, over the time frames managers care about, than good software
Fighting complexity is deeply unpopular.
The framing is not, though. Why does it have to sound so dramatic and provocative? It’s insulting to its audience. Grumpiness, in the long term, is a career-limiting attitude.
Where I've had problems it's usually been one of:
There wasn't anything logged in the error block. A comment saying "never happens" is often discovered later :)
Too much was logged and someone mandated dialing the logging down to save costs. Sigh.
A new thread was started and the thread-local details including the correlation ID got lost, then the error occurred downstream of that. I'd like better solutions for that one.
Edit: Incidentally a correlation ID is not (necessarily) the same thing as a request ID. An API often needs to allow for the caller making multiple calls to achieve an objective; 5 request IDs might be tied to a single correlation ID.
The best way to equip logs to tell the truth is to have other parts of the system consume them as their source of truth.
Firstly: "what the system does" and "what the logs say" can't be two different things.
Secondly: developers can't put less info into the logs than they should, because their feature simply won't work without it.
Seeing logging as debugging is flawed imo. A log is technically just a record of what happened in your database.
Point #1 isn't true, auto instrumentation exists and is really good. When I integrate OTel I add my own auto instrumentors wherever possible to automatically add lots of context. Which gets into point #2.
Point #2 also isn't true. It can add business context in a hierarchal manner and ship wide events. You shouldn't have to tell every span all the information again. Just where it appears naturally the first time.
Point #3 also also isn't true because OTel libs make it really annoying to just write a log message and very strongly pushes you into a hierarchy of nested context managers.
Like the author's ideal setup is basically using OTel with Honeycomb. You get the querying and everything. And unlike rawdogging wide events all your traces are connected, can span multiple services and do timing for you.