Moving a large-scale metrics pipeline from StatsD to OpenTelemetry / Prometheus

(medium.com)

76 points | by jmarbach 2 days ago

10 comments

dig1 2 days ago
> The irony is that this may be a $0 revenue user for Grafana Labs.
Why is that ironic? Since Mimir is open-source, $0 revenue users are expected. AFAIK, Grafana Labs relies heavily on go, typescript, and linux, without necessarily being their top financial contributor. They could have kept Mimir proprietary like Splunk, but whether that would have attracted the same level of adoption or community contribution is another matter.
[-]
- camel_gopher 1 day ago
  Grafana knows their open source products are eating into revenue. Expect corresponding strategy to offset that.
  [-]
  - skrtskrt 1 day ago
    It already exists, it’s their Bring Your Own Cloud offering.
    It’s to retain customers that grew big enough on Grafana Cloud to justify having their own in-house team run the tools instead. So Grafana offers them a pricing where the Grafana engineers operate the platform within the customer’s cloud account. Very large customers get to keep not having to operate and build/hire for the expertise, and save some money.
    Sure some companies are big enough to make it worth it and still want to run their own OSS observability stack, but it’s generally not going to be popular with executive decision-makers, so it likely will remain rare. And if they do run it, Grafana still benefits from their contributions to AGPL code.
    On the low-spending end, OSS users not buying cloud would not really be a serious revenue concern. They just don’t spend enough. You use cloud if tou have super broad product usage, so you don’t have to run and maintain Grafana, Mimir, Loki, Tempo, Pyroscope, k6, etc. all yourself. If you don’t want or need all that, you run Loki+Grafana yourself and enjoy.
codeduck 2 days ago
> given Prometheus’s widespread adoption and proven reliability in diverse environments.
I have used Prometheus a lot. Reliable is not a word I would associate with it.
[-]
- pahae 2 days ago
  I set up a fairly large Prom-based architecture which I later on migrated to VictoriaMetrics (VM) so I think I can chime in here.
  Both Prom and VM are exceptionally stable in my opinion, even on _very_ large scales. There were times when I had a single (Prom, later VM) and not-overly-large instances scrape 2Mio samples/s without any issues. In addition to fairly spiky query loads.
  However, if something does go wrong, the single most impactful difference between VM and Prom is simply the difference in startup time. Prometheus with 2TB of metrics takes _forever_ to start up. We're talking up to 2 hours on SSD while VM just... starts.
  [-]
  - porridgeraisin 1 day ago
    Yeah, at previous work we used both as well. The transition from prom to vm was "ongoing" and from the time I joined to the time I left we did parallel writes to both. Never faced issues with either. If I remember correctly, we wrote from services to a kafka queue first, and then a consumer took that and pushed it to (both) the metrics endpoint(s).
  - Serhii-Set 1 day ago
    [dead]
- hagen1778 2 days ago
  What do you use instead of Prometheus?
  [-]
  - codeduck 2 days ago
    Given a choice, VictoriaMetrics. It has proven itself time and time again at scale, and requires a very low support investment.
- Serhii-Set 1 day ago
  [dead]
qmarchi 10 hours ago
Interesting choice to go with Prometheus directly, especially when other TSDBs have "native" support for OTLP ingestion support.
blueybingo 1 day ago
the zero injection fix for sparse counters is the most underrated part of this writeup -- injecting a synthetic zero on first flush to anchor the cumulative baseline is actaully a pretty elegant solution to a problem that bites almost every team migrating from delta-based systems to prometheus, and the fact that they centralized it in the aggregation tier rather than pushing the fix to every instrumentation callsite is exactly the right call.
[-]
- valyala 1 day ago
  There is another approach for solving this issue - to use increase_pure() function from MetricsQL - https://docs.victoriametrics.com/metricsql/#increase_pure . Of course, you need to switch to VictoriaMetrics, since Mimir doesn't support this function.
- hagen1778 1 day ago
  I was under impression that problem of zero injection was solved with Start Timestamp from OpenMetrics 2.0 spec - see https://prometheus.io/docs/specs/om/open_metrics_spec_2_0/#s...
jameson 2 days ago
Curious why the team choose Grafana Mirmir over VM cluster?
[-]
- esafak 1 day ago
  How are these substitutes? Mimir is a time series database.
  edit: I understood virtual machine :)
  [-]
  - igor47 1 day ago
    So is Victoria metrics?
valyala 1 day ago
It is interesting why Airbnb uses vmagent for streaming aggregation and didn't switch from Mimir to VictoriaMetrics. This could save them a lot of costs on infrastructure and operations, like in cases of Roblox, Spotify, Grammarly and others - https://docs.victoriametrics.com/victoriametrics/casestudies...
[-]
- jmarbach 1 day ago
  Could you share a little more about your involvement with VictoriaMetrics? A good faith disclosure goes a long way.
  [-]
  - valyala 1 day ago
    I'm core developer at VictoriaMetrics. This information is one click away - just click my name here.
awoimbee 2 days ago
Directly emitting metrics using OTLP instead of having the OTel receiver scrape the metrics endpoint is interesting. I never made that move because the Prometheus metrics endpoint works and is so simple, and it's what most projects (eg kubernetes) use.
[-]
- igor47 1 day ago
  A long time ago, I introduced dogstatsd at Airbnb. We had already been using vanilla statsd (with no tag support -- cardinality lived in the metric name!) and this was a low cost migration. More than a decade later, I'm assuming it was difficult to track down and refactor all the places that statsd calls were emitted and using OTLP was an easier route. This is a great example of how technical decisions compound over time.
zbentley 1 day ago
> Initially, we anticipated that the edge case would have minimal impact, given Prometheus’s widespread adoption and proven reliability in diverse environments. However, as we migrated more users, we started seeing this issue more frequently, and it stalled migration.
That's a very professional way of saying "Wait, everyone just lives with this? What the fuck?!"
Many such cases in the Prometheus ecosystem.
fgfhf 1 day ago
[dead]