Ask HN: Why is observability so broken, and what can fix it?

6 points | by idea0rbit 12 hours ago

1 comments

zarathustra333 11 hours ago
are you talking about observability for AI workflows or more generally? I have a friend working on the former.
[-]
- idea0rbit 11 hours ago
  both. In general I think most observability systems are broken. This article captures the sentiment pretty well
  https://www.linkedin.com/pulse/observability-broken-its-time...
  [-]
  - tanelpoder 10 hours ago
    Good article, thanks for sharing. I've been working on one part of this problem space for quite a while too. I want ability to directly drill down into latency reasons and underlying application component threads' wall-clock time, instead of having to correlate various systemwide utilization metrics and try to manually connect the dots.
    I'm using eBPF-based dimensional data analysis, starting from bottom up (every system is a bunch of threads, including distributed systems) and move up from there. This doesn't replace existing distributed tracing approaches for end to end request view, but gives you deep observability all the way down to each service's underlying threads' wall-clock time (where blocked, sleeping and why, etc).
    At this year's P99CONF I will launch the first GA release of my (open source) 0x.tools xcapture eBPF collectors, with a reference implementation of a TUI tool, showing dimensional performance modeling on these new thread sampling signals (xtop).
    A couple of 1-minute asciicasts of xtop are here: https://tanelpoder.com/posts/xcapture-xtop-beta/