Here at AppFirst, we are proud of our extensive alerting system and how customizable it is. Users can set alerts on any data source they have in AppFirst, whether it’s application data, log data, statsD metrics, or any partner data points. Users can set alert notifications on anything from CPU to # of Processes running. When we see any of these metrics rise above (or below) the static, user-defined thresholds, we will email, send push notifications, or send text messages to our users.
But while the fixed threshold setup has pleased many of our users, it does not necessarily apply to everyone.
Let’s say you are an e-commerce website that is trying to determine if your systems are performing normally. Naturally, you’d want to monitor CPU and memory usage, but your system’s performance is highly dependent on user activity which can be affected by parameters such as holiday sales on your goods. Assuming you are not giving away free donuts with every purchase you would expect a gradual increase in web activity. This means using the same static threshold for CPU or memory does not make sense since what’s “normal” will change over time. What you really need is a rolling window that monitors changes as demand increases or decreases.
Dealing with this scenario is difficult with most basic alerting products on the market. That’s why we’re excited to introduce our new alerting feature called Analytical Alerting. We’ve released this to all of our customers and it can be actively selected in the alert management page by changing the alert type from “Single Threshold” to “Out of Band.”
When “Out of Band” is selected, additional horizontal lines will be rendered on the graph. A blue line represents the mean, while the orange lines represent the mean plus or minus the standard deviation times the band value.
The “Out of Band” alert allows you to set a trigger when the value goes Above, Below, or Outside the orange boundaries displayed above. Choosing the “Outside” option will trigger an alert if the value is above or below the user defined band of what’s normal. The grey “band” between the orange lines represent what is normal for that metric. This band changes over time based on the rolling calculation window of the last X hours or days, which is also user definable.
Choose “Above” or “Below” if you only care about the upper or lower bounds of the band.
So how exactly are we calculating this data? At AppFirst we store the history for every metric, so based on the user-defined window, we calculate the mean and standard deviation over that time period. If you refer to the figure above, we are calculating the mean and standard deviation for CPU for server “frontenddev” over the last 1 hour and multiplying the standard deviation by a factor of 2.19. If, at any minute, we notice that the CPU for this server rises above the upper boundary specified by mean + standard_deviation * band_value then the alert is triggered. It is important to note that the mean and standard deviation changes over time since the window period always references the most recent historical data.
The bell curve above represents what a normal distribution should look like. As you can see the majority of the metric value (95.4%) will reside within two standard deviations away from the mean, but you may adjust this to your liking depending on your data.
This is just the beginning of our transition to providing smarter analytics and we can’t wait to show you what we have in store for you next year.
As always if you have any questions feel free to contact us and we wish everyone happy holidays.