This week at AppFirst – 10/21/10

Here are the latest updates at AppFirst this week:

New Features 

  • Added duration to alerts Now you can set a duration for which the alert condition must be true before it triggers
  • Faster run-time The whole website should be much faster now, especially if you are on a slow connection

Bug Fixes

  • Fix some cases of false positives on collector down alerts
  • Fix slowness of Application creation/editing dialog for large tags
  • Fix minor bugs with users whose free trial has expired

Have you noticed a problem that needs fixing, or do you have a great idea for a new feature on AppFirst? Let us know! We rely on your feedback to make a product that works for everyone!

We’re Up!

Thank you for your patience! Though we told you we’d need a 24 hour time period for the maintenance window we scheduled this weekend in fact we weren’t available for 1.5 hours. You haven’t lost any data it’s all there. We are sure you will appreciate the peppiness of the UI and feel free to add as many servers as you have. Tell all your friends that AppFirst is the only monitoring service that delivers!

Thanks to Mike Schroll, one of our incredible customers and supporters for his insight into how to make DNS propagation happen quickly!

And dev team, you ROCK!

Maintenance Window This Weekend

Thank you for your continued support of AppFirst! Keeping up with growing needs we have undertaken an infrastructure and architectural backend redesign which results in better performance and scalability. After testing for several weeks we are ready to move it into production.

This weekend AppFirst will be taken down on Saturday, Oct 16, 2010 at noon ET and will be back up on Sunday, Oct 17, 2010 at noon ET. It may be sooner than that, it will depend on DNS propagation. During this 24 hour period you will not receive any alerts. You won’t loose any data, the collector will continue to collect data and store it on your servers until it can reconnect with our backend to send it all up. On Sunday afternoon you’ll be able to access AppFirst the same way you always have, wwws.appfirst.com.

If you have any questions or concerns, please let us know. Sorry for the inconvenience but we are sure you will be very happy with the results which positions us to serve you well as your infrastructure grows!

Monitoring to Prevent Outages That Foursquare Experienced

As a startup it is painful to read about the trials and tribulations of other startups but the reality is, things happen and life isn’t always rosy. The most important thing is to learn from what happened and move on. Last week Foursquare experienced one of those painful moments. Another important element in those challenging moments is to communicate and be transparent because users understand things happen and they are more likely to ride through if they feel the company is being truthful, owning up and learning from the situation. Foursquare and the MongoDB teams are to be commended for their transparency.

If you read the post explaining the cause of Foursquare’s outages and the actions taken you’ll see the root cause of the issue was, exceeding memory. They had 66GB of RAM on a machine and data they kept in RAM grew to 67GB. Simple thing, it happens all the time, memory, CPU, disk expand beyond our expectations.

But the result doesn’t have to be the same as it was with Foursquare. If Foursquare had been using AppFirst to monitor their servers they would have been alerted before it was too late. Not only would they have known the server was running out of memory but they would have known what process or set of processes in their application was causing memory usage to increase significantly.

Out of the box: AppFirst creates default alerts on CPU, disk and memory so when the Foursquare system got to 80% of their memory utilization they would have known about it. They could have taken immediate action to increase memory, especially since they are running on EC2, that is the promise of cloud computing. Then with the immediate disaster avoided they could have drilled into their code knowing exactly which part to focus on because AppFirst would have identified the process or set of processes causing the issue. But they weren’t using AppFirst so they had no visibility into memory utilization and hence went down. The length of the outage was as a result of a chain effect from the root cause of running out of memory. Again, if Foursquare were using AppFirst they wouldn’t have gone down in the first place AND wouldn’t have experienced a multi-day outage.

As startups we are moving fast but it is important to keep in mind the basics, like monitoring our servers and applications. It is extremely easy to do with AppFirst since we are a SaaS-based service all you need to do is download and install a collector on your servers, which takes minutes. You’ll see real-time application and infrastructure data within minutes. You won’t be flying blind like Foursquare was.

Why Should You be Writing Custom Probes?

CA Nimsoft is telling customers and prospects that they must write custom probes in order to have rich monitoring. What are they thinking? You have much more valuable things to do with your limited time and resources. Going that route requires not only an initial investment of your time, but also the continuous maintenance nightmare of keeping your custom probes updated. CA Nimsoft thinks it’s fine to put the engineering and support burden on their customers. Don’t let them do it!

With AppFirst there is no need for you to write custom probes – we do all the engineering and support for you. We know you have a business to run, so you can leave the monitoring to us. Just by installing our collector on your servers (which takes a minute or so), you will immediately be able to access all your application and infrastructure data, you won’t miss anything. Any app, any language, anywhere it is running, complete visibility.