Hi everyone,
I recently joined raintank and I will be working with @torkelo, @mattttt , and you, on alerting support for Grafana.
From the results of the Grafana User Survey it is obvious that alerting is the most commonly missed feature for Grafana.
I have worked on/with a few alerting systems in the past (nagios, bosun, graph-explorer, etsy's kale stack, ...) and I'm excited about the opportunity in front of us:
we can take the best of said systems, but combine them with Grafana's focus on a polished user experience, resulting in a powerful alerting system, well-integrated and smooth to work with.
First of all, terminology sync:
- alerting: executing logic (threshold checks or more advanced) to know the state of an entity. (ok, warning, critical)
- notifications: emails, text messages, posts to chat, etc to make people aware of a state change
- monitoring: this term covers everything about monitoring (data collection, visualizations, alerting) so I won't be using it here.
I want to spec out requirements, possible implementation ideas and their pro's/cons. With your feedback, we can adjust, refine and choose a specific direction.
General thoughts:
- integration with existing tools vs built-in: there's some powerfull alerting systems out there (bosun, kale) that deserve integration.
Many alerting systems are more basic (define expression/threshold, get notification when breached), for those it seems integration is not worth the pain (though I won't stop you)
The integrations are a long term effort. I think the low hanging fruit ("meet 80% of the needs with 20% of the effort") can be met with a system
that is more closely tied to Grafana, i.e. compiled into the grafana binary.
That said, a lot of people confuse seperation of concerns with "must be different services".
If the code is sane, it'll be decoupled packages but there's nothing necessarily wrong with compiling them together. i.e. you could run:
- 1 grafana binary that does everything (grafana as you know it + all alerting features) for simplicity
- multiple grafana binaries in different modes (visualization instances and alerting instances) even highly available/redundant setups if you want to, using an external worker queue
That said, we don't want to reinvent the wheel: we want alerting code and functionality to integrate well with Grafana, but if high-quality code is compatible, we should use it. In fact, I have a prototype that leverages some existing bosun code. (see "Current state")
- polling vs stream processing: they have different performance characteristics,
but they should be able to take the same or similar alerting rule definitions (thresholds, boolean logic, ..), they mostly are about how the actual rules are executed and don't
change much about how rules are defined. Since polling is much simpler and should be able to scale fairly far this should IMHO be our initial focus.
Current state
The raintank/grafana version currently has an alerting package
with a simple scheduler, an in-process worker bus as well as rabbitmq based, an alert executor and email notifications.
It uses the bosun expression libraries which gives us the ability to evaluate arbitrarily complex expressions (use several metrics, use boolean logic, math, etc).
This package is currently raintank-specific but we will merge a generic version of this into upstream grafana. This will provide an alert execution platform but notably still missing is
- an interface to create and manage alerting rules
- state management (acknowledgements etc)
these are harder problems, which I hope to tackle with your input.
Requirements, Future implementations
First off, I think bosun is a pretty fantastic system for alerting (not so much for visualization)
You can make your alerting rules as advanced as you want, and it enables you to fine-tune over time, backtest on historical data, so you can get them just right.
And it has a good state machine.
In theory we could just compile bosun straight into grafana, and leverage bosun via its REST api instead of Golang api, but then we have less finegrained control and
for now I feel more comfortable trying out piece by piece (piece meaning golang package) and make the integration decision on a case by case basis. Though the integration
may look different down the road based on experience and as we figure out what we want our alerting to look like.
Either way, we don't just want great alerting. We want great alerting combined with great visualizations, notifications with context, and a smooth workflow where you can manage
your alerts in the same place you manage your visualizations. So it needs to be nicely integrated into Grafana. To that end, there's a few things to consider:
- some visualized metrics (metrics plotted on graphs) are not alerted on
- some visualized metrics are alerted on:
- A: with simple threshold checks: easy to visualize alerting logic
- B: with more advanced logic: (e.g. look at standard deviation of the series being plotted, compare current median against historical median, etc): can't easily be visualized nex
to the input series
- some metrics used in alerting logic are not to be vizualized
Basically, there's a bunch of stuff you may want visualized (V), and a bunch of stuff you want alerts (A), and V and A have some overlap.
I need to think about this a bit more and wonder what y'all think.
There will definitely need to be 1 central place where you can get an overview of all the things you're alerting on, irrespective of where those rules are defined.
There's a few more complications which I'll explain through an example sketch of how alerting could look like:

let's say we have a timeseries for requests (A) and one for errorous requests (B) and this is what we want to plot.
we then use fields C,D,E to put stuff that we don't want to alert on.
C contains the formula for ratio of error requests against the total.
we may for example want to alert (see E) if the median of this ratio in the last 5min ago is more than 1.5 of what the ratio was in the same 5minute period last week, and also
if the errors seen in the last 5min is worse than the errors seen since 2 months ago until 5min ago.
notes:
- some queries use different timeranges than what is rendered
- in addition to processing by tsdb (such as Graphite's sum(), divide() etc which return series) we need to be able to reduce series to single numbers. fairly easy to implement (and in fact currently the bosun library does this for us)
- we need boolean logic (bosun also gives us this)
- in this example the expression only uses variables defined within the same panel, but it might make sense to include expressions of other panels/graphs.
other ponderings:
- do we integrate with current grafana graph threshold settings (which are currently for viz only, not for processing) ? if the expression is a threshold check, we could automatically
display a threshold line
- using the letters is a bit clunky, could we refer to the aliases instead? like #requests and #errors?
- if the expression are
stats.$site.requests
and stats.$site.errors
, and we want to have seperate alert instances for every site (but only set up the rule once)? what if we only want it for a select few of the sites. what if we want different parameters based on which site? bosun actually supports all these features, and we could expose them though we should probably build a UI around them.
I think for an initial implementation every graph could have two fields, like so:
warn: - expression
- notification settings (email,http hook, ..)
crit: - expression
-notification settings
where the expression is something like what I put in E in the sketch.
for logic/data that we don't want to visualize, we just toggle off the visibility icon.
grafana would replace the variables in the formula's, execute the expression (with the current bosun based executor). results (state changes) could be fed into something like elasticsearch and displayed via the annotations system.
Thoughts?
Do you have concerns or needs that I didn't addres?