With Halloween just a few days away, we’re here to give an early scare of our own! First, check out our Monitor Horror Story Infographic: The Haunted House of Application Performance! You’ll find some scary monitoring statistics and the monitoring costume party of IT.
And then, check out the winners of our Horror Story contest. Hopefully these scary tales won’t ruin the true spirit of Halloween!
Make sure a light’s on when you’re reading these horror stories.
First Place Story
When monitoring is out and away, the ghouls are at play
Company: Random Software Company
So I am hired as a Sr. Network engineer for this company. From my first day working there, without any knowledge of the infrastructure, I am asked to fix tons of little gremlins in the network and their systems. From day one I hear complaints on slowness, connection issues, application disconnect issues, vpn issues, and more. At this point there had not been a network person on staff for over a year. They had one person in their IT staff that had no time to do anything other than fix printers and user applications.
During my first week there I started looking over the network, web servers, and back end database servers. At this point it is very clear that they have no monitoring in place at all. No way to tell if servers are over utilized or if databases are functioning properly. They also have no monitoring on any of the network devices. It is my second week in that I am hit hard with the realization that there is just nothing in place yet and they want me to fly in like a superhero and fix all their problems in the blink of an eye.
So my first task is to implement some sort of monitoring to allow me to get some insight into the inter-workings of the network and systems in the company. I evaluated a lot of different products that week including AppFirst. Right away I found a ton of problems including the fact that the MPLS network was set up in such a way that all traffic coming in was forced from the entry point from the internet located in the Midwest, over to California to a very small remote office, and then back to the location in the Midwest. This was causing all traffic for the websites to hit our firewall in the Midwest then travel to California and then travel back to the Midwest to hit the web servers for no reason at all. Fixing this issue alone solved all of the application issues.
“…they want me to fly in like a superhero and fix all their problems in the blink of an eye.”
From there, I focused on the web servers and back end servers to see what other problems were occurring. I noticed right away that multiple servers had very little or no hard drive space left, which was causing issues. I also noticed that resources were being over utilized on our database machines as well. I quickly put an action plan together to resolve those issues as well.
To make a very long story short, I have been at this company for about two months now and have fixed a ton of problems going from no monitoring at all to what we have in place now.
Second Place Story
Server ghouls haunt bulk ingestion
I was suppose to monitor an Ingestion Server that was performing a bulk ingestion through an EC2 instance with around 200 GB of data to be ingested to another server.
Since it was a huge amount of data and the ingestion would take another day to complete, I kept the ingestion going and the logs were performing well. I decided then that I’d log in early tomorrow morning to check the ingestion status. During this time, the log files were supposed to be created automatically through the ingestion and the name of the log file for any particular day should be log_dd-mm-yyyy.txt with date of that day mentioned. It was a staging server and the code was supposed to be supplied for UAT in a day or two.
I logged in early the next morning to check the ingestion status. I was totally puzzled as I couldn’t make out what was happening:
- The log file for the previous day log_27-08-2013.txt was showing everything went well until 11pm midnight and no logs thereafter.
- The log file for today log_28-08-2013.txt got created with no data in it.
- The ingestion process was running with no errors.
- The server logs showed no errors.
- The system never went down.
- Nearly 150 GB of data was still to be ingested and was not progressing at all.
- None of the logs showed any updates as to why the ingestion was not progressing.
Since the delivery was urgent, I stopped the ingestion on the instance and restarted it. To my horror, the ingestion was not progressing at all. I tried running ingestion on other instances, and it worked fine.
Then something hit me, and I went back to check the logs of ingestion. The ingestion logs still showed nothing with 0 kb space used by the logs. Wait!!! Space? 0kb? 150 GB data still remaining?
I immediately checked the disk space and found zero space available. Whoaa!!!
What actually happened is while performing the ingestion, the server created a duplicate copy of the data on the same instance, and until the entire ingestion completes, this data used to remain there. Around 250 GB of disk space was used by ingestion by midnight and the disk was full. I immediately attached a bigger volume to the instance and restarted the ingestion. Thankfully it was complete in a few hours and that saved me from a big trouble!!!
Third Place Story
The unknown ghoul is still on the loose
We’re a Nagios shop and every now and then we get alerts for machines that simply don’t exist. We’ve paid people to check our configurations and no one can find anything wrong.
The latest server ghost appeared a few nights ago and every admin was paged for a machine that was down. We have a relatively new admin who flipped out because he thought we added a new server that he couldn’t reach. He got the datacenter on the phone immediately to have them look into it — needless to say the technician thought we were out of our minds for making them look into a server that never existed. We checked the other machines, thinking that wires had crossed somewhere and they were all fine.
So we still don’t have any idea what even triggered Nagios to alert everyone.
Have any scary stories you want to share before the holiday? Let us know in the comments!