As a startup it is painful to read about the trials and tribulations of other startups but the reality is, things happen and life isn’t always rosy. The most important thing is to learn from what happened and move on. Last week Foursquare experienced one of those painful moments. Another important element in those challenging moments is to communicate and be transparent because users understand things happen and they are more likely to ride through if they feel the company is being truthful, owning up and learning from the situation. Foursquare and the MongoDB teams are to be commended for their transparency.
If you read the post explaining the cause of Foursquare’s outages and the actions taken you’ll see the root cause of the issue was, exceeding memory. They had 66GB of RAM on a machine and data they kept in RAM grew to 67GB. Simple thing, it happens all the time, memory, CPU, disk expand beyond our expectations.
But the result doesn’t have to be the same as it was with Foursquare. If Foursquare had been using AppFirst to monitor their servers they would have been alerted before it was too late. Not only would they have known the server was running out of memory but they would have known what process or set of processes in their application was causing memory usage to increase significantly.
Out of the box: AppFirst creates default alerts on CPU, disk and memory so when the Foursquare system got to 80% of their memory utilization they would have known about it. They could have taken immediate action to increase memory, especially since they are running on EC2, that is the promise of cloud computing. Then with the immediate disaster avoided they could have drilled into their code knowing exactly which part to focus on because AppFirst would have identified the process or set of processes causing the issue. But they weren’t using AppFirst so they had no visibility into memory utilization and hence went down. The length of the outage was as a result of a chain effect from the root cause of running out of memory. Again, if Foursquare were using AppFirst they wouldn’t have gone down in the first place AND wouldn’t have experienced a multi-day outage.
As startups we are moving fast but it is important to keep in mind the basics, like monitoring our servers and applications. It is extremely easy to do with AppFirst since we are a SaaS-based service all you need to do is download and install a collector on your servers, which takes minutes. You’ll see real-time application and infrastructure data within minutes. You won’t be flying blind like Foursquare was.