When your Website Slows Down, Don’t Frown

Turn that frown upside down

It can be a real nail biter to figure out why your site took a turn for the worst. While it may not have intentionally slipped into its sluggish habits, it did. This condition is not forever. You just have to get your site back on track. What trainer will be able to bust your site’s buns back in shape, you ask? The one and only, Deterministic Root Cause Algorithm.

The lean, mean, mathematical fighting machine tracks down the reason for high response time. In an N-tier architecture, the problem can be with the web server, the database server, or other server components. To effectively pinpoint the problem, we determine which server is the source of the high response time. The algorithm traces backwards, checking for a large increase in response time between each server. As soon as the large increase dissipates, we can narrow in on the server, the root cause of the high response time.

Here is the skinny: The slowness initially affects a server. Then, it propagates back toward the web server. We determine which resources had increased usage, and which specific processes were using these resources.

The Technical Details:

Alerts are automatically set up on each server for high response time. We set this to a lightning speed of 1 second for 5 consecutive minutes. As a user, you can then configure the threshold and duration desired.

When the average response time (ART) is calculated for a server, we consider the processes which we have intercepted and summarized. We want to get an accurate picture, so we only consider connections where that machine is the server, and not ones where it is a client. Once we see the impact on the server, we can fine tune the alert so the issue does not affect the connecting clients.

The Algorithm:

Let’s break it down:

1. If the threshold is reached for the length of the set duration, an alert will be triggered. We compare the ART for each minute of the duration to the ART for the previous hour. This allows us to calculate the ART over all connections to the server. If the ART for each minute is 50% higher than it was in the preceding hour, the root cause algorithm is in play. However, if this happens to not be the case, we trigger a normal resource alert, similar to alerts on high memory and CPU.

2. From the alerts sent on the initial server, we trace back to the connections. This machine serves as a client to other servers. For example, if the initial server is A, and it connects to a server B, then we get the ART for server B when server A is connected to it. If the value for each minute of the duration is at least 25% more than in the previous hour, we trace back to server B. But, if it is not the case, (you got it), it is the root cause.

3. If there are multiple servers to trace, which all have a 25% increase in response time, we follow every path. This can result in multiple root cause servers and/or multiple paths to a single root cause server. You want the report on all of them? You got it! If there are more than three paths to a server at the end of running the algorithm and we are not confident of the cause, we will keep you in the loop with a regular resource alert email until we can identify the cause.

4. Once we determine the root cause(s), we examine resource usage on the root cause server. We determine which resources had a large percent increase in usage. Then for each of these resources, we find which processes had the largest usage increase. Knowing this, we can report, for each root cause server, which resources had a large usage increase and which specific processes are the root cause.

Limitations:

As with everything, there is always a limitation. The ART is calculated based on the collected data from intercepted and summarized processes. But, if server processes are not intercepted, things just do not add up. Therefore, too many short-lived processes cannot be summarized. You can count on us to crunch the numbers and isolate the problem if we can.

This kind of limitation exists with the Postgres database. Because so much work is done in short lived processes, we often cannot record any ART value for Postgres. If the problem is with this database, there would be no way to trace it.

While we cannot guarantee that the resources and processes mentioned above are the root cause, we have a hunch that it is the case. And if not, we will not stop until we find it.

The Formula:

The greater increase in response time and resource usage, the more likely it will lead us to the accurate root cause.