About Michael Forhan

Michael is a Systems Engineer with AppFirst. He's been a developer and systems administrator since Linux kernel release 1.2. Michael's experiences include operations in web gaming, healthcare, news media and on Submarines in the US Navy. He is an avid tinkerer who loves learning, solving problems and socializing.

The Value of Context in Web Scale IT

Frequently in the world of monitoring we hear about the miraculous solutions provided from a single agent. How the world can be solved with transaction tracing, user experience monitoring and understanding the performance of the software pipeline as it applies to business.

All of these are great starting points, but today I’ll go through the case where the software pipeline is affected by the environment in which it lives, and application monitoring only shows the symptoms.

For example, my dashboard this morning:

AppFirst Dashboard Frontend Response Time Spike

The specific item in question, “Frontend Response Time” has a high value for a period this morning.

Frontend response time spike

And as I look into this value, it becomes evident that this is being caused by something external to the software stack.

As a quick background, the eCommerce solution I’m running (magento) uses files to keep track of each user as they access the site. A normal activity which isn’t excessively risky. However, due to this activity, thousands of session files exist which quickly eat up all available inode space if no action is taken.

Session files east up all available inode space is no action is taken

Now, performing normal maintenance, removing these excessive files is easy enough. However, a simple ‘rm -f’ command is insufficient (too many items). This is easily remedied with xargs:

Removing excessive files with xargs

However, execution like this takes time. Time that we can see with the data we’ve collected from the system, and graphed in-line with the web stack performance data.

System and web stack performance graph

The blue line is the ecommerce response time, the purple line the system disk utilization, and the orange line is the number of files being accessed by xargs. You can see the clear correlation between running the file removal on the response time, as well as the reduction in storage space. As soon as the files have been removed, the response time recovers.

If you’d like to stop treating symptoms and see how the environmental context affects your business, sign up for a free 30 day trial at http://www.appfirst.com/signup/.

Managing the Cloud with Devices

With the ongoing transition of IT organizations from physical to virtual environments, the desire for complete visibility across the stack becomes ever more important. AppFirst has worked hard to bring a complete and real-time visibility to applications and resources, and we’ve taken another step forward by offering deeper visibility into the hypervisor.

Our initial release of “Devices” and support for VMware’s ESXi showcases a single point of access for understanding your full stack application and the virtualized environment in which it lives.

This is important for a couple of key reasons:

  1. Visibility into application resources breaks down in a virtual environment when stolen time increases – even at 10% stolen time, CPU metrics become almost meaningless, making things much more difficult to manage.

  2. Telling a consistent story from virtualization is critical in maintaining and improving your environment. Having hypervisor access visibility with ESXi lets you compare the information at the VM level with the Hypervisor to see how much is going on between the two.

Devices support goes beyond just hypervisors and ESXi though. The goal of devices is to provide full insight across agent and agentless monitoring and providing a UI that is consistent. For practical purposes, this means you no longer have to query edge network devices by browsing through a server for instance. You don’t need to locate and manage plugins on hosts and pass this knowledge through your organization as “tribal” – something known by the administrators through their own interactions. You can now see your routers or other agentless devices as their own services with their own service measurements.

So if you’re using VMware ESXi, check out our help documentation for how to get started with Devices. We’d love to know what you think and how we can make it easier to get visibility into your hypervisor.

Big Data Growing Pains with HBase

Over the last week there has been a lot going on at AppFirst. With the data delays on the console, many of our users have been feeling it as well.

We’ve built a robust pipeline to handle the data from thousands of servers streaming data into our backend. We’ve learned a tremendous amount about the limitations of various software including Nginx, RabbitMQ, Redis and now, HBase.

HBase is the core for the large volume data store for our public SaaS offering, and over the last month we’ve been reaching some disk usage limits that have required frequent maintenance and capacity management. This is nothing new in the operations world, but what we’ve learned is HBase doesn’t take as nicely to server removals as we believed.

During the last week we’ve been upgrading capacity and performing data maintenance with more regularity. Adding storage should be transparent to HBase. However, in the process of taking a node down, adding disks, and rejoining the cluster, the change causes enough of an event to back up our queues.

We are actively working to find new ways to improve this flow and minimize the impact to our customers. It is important to note that while there was a delay in data processing to our backend, there was no data loss – our queueing systems managed this condition as expected. Once HBase was back online, it handled data storage very well.

Additionally, over the weekend, we found that our implementation of Zookeeper was performing a significant amount of disk read/write activity. As an in-memory distributed coordination and synchronization tool, disk I/O is a serious concern when it comes to operations.

When examining the source of this disk I/O, we found that our Zookeeper was configured to use 9 nodes. Apache recommends that Zookeeper run in a 3, 5, or 7 node configuration. As you can see from the graph, once we updated the configuration to run on a 5-node setup, disk I/O fell dramatically.

We are aware of the impact the configuration and maintenance has on our users and infrastructure. We have been developing a new storage model that improves data maintenance capability and we continue to analyze the impact of system configurations. These changes will result in a much smoother experience.

Network Monitoring over SNMP

SNMP is a simple, easy way of monitoring your network and infrastructure. With a basic understanding of the MIB for your device and its OID tree, you can easily capture the data to visualize your network activity across various parts of your organization. Whether you are monitoring network endpoints, traffic volumes, or even wireless signal strength, there is a lot of capability built into SNMP that AppFirst can easily tap into.

As a remote worker, I’m constantly working from wireless networks, and it’s important for me to understand if my connectivity issues are related to the wireless interface, the upstream network, or network traffic volume.

In this example, I’ll show how easy it is to collect SNMP data using polled data, as well as expanding out the collection to show WiFi connectivity per device. This is particularly helpful in understanding if you need to expand wireless coverage in your organization, change channels, or provision new equipment. It may also give you visibility into unauthorized wireless repeaters in your organization by monitoring signal ratio trends throughout your deployment.

The core of SNMP monitoring is understanding the MIB tree for your device. As a remote worker, I utilize consumer grade hardware (Linksys WRT54G) which I’ve installed DD-WRT. This enables the ability to view my network at depths not normally provided through the hardware, but at a much lower price point. The upside is that the MIB hierarchy is a standard and applicable to enterprise level devices.

The first part of monitoring is understanding the values you want to monitor. With this I’ve opened up my OID tree with a walk of my device using iReasoning MIB browser. With a little help from the DD-WRT forums, I’ve determined the values I’m interested in are:

SNMP

These two addresses give me a breakdown of SNR (signal to noise ratio) of each of the devices on my network by MAC address. The MAC address is moderately useful if I maintained a list. However, I find it more valuable to translate that to IP, so I need to make a script to cross reference this data. The MIB item that displays the IP address assignment with the MAC is located by my walk in the MIB browser:

MIB Browser

As illustrated above, you can see that its part of the ioNetToMediaEntry, or 1.3.6.1.2.1.4.22.1, with a breakdown of each IP address and its associated MAC address. With these pieces of information I built a simple Linux shell script to utilize snmpwalk to loop through my DHCP address pool and collect the MAC addresses, then correlate those with my previous entry for MAC address and SNR ratio for each of the wireless clients on my network. When executed, the shell script returns the following:

/usr/share/appfirst/plugins/libexec/check_snr.sh OK: 2 | addresses=2; 192.168.1.135=61; 192.168.1.137=35

This output contains “performance data” after the pipe character which is mapped into correlate and onto the dashboard via the multi-value widget.

AppFirst Multi Value Widget

I can also see a history of my wireless signal for each address assigned in my network so I can track connection problems internal to my network when they occur.

Network Correlation

Security Tools of the Future

Today I read an article on the ‘Security Tools of the Future,’ by Edward Haletky. Edward makes great points about the limits of fingerprinting technologies with regards to system exploitation. He recommends several products in his post, including AppFirst, as part of the new breed for getting better security-specific logs out of clouds. While AppFirst does a great job of this in itself (try it for yourself), our true value proposition comes in the form of Edwards’ first point: “…we need first to understand what is considered normal behavior…” And once we have this, Edwards continues, we can then get to the abnormal.

Defining normal behavior is a part of the package at AppFirst. With our process based data collection, we can define what is normal from the interactions on the users system, the applications they use, and the network resources consumed.

Application Footprint

An example of application footprinting from the AppFirst web console

It is this normal activity that can then be used to find changes in those applications, in resource utilization, and in finding the ‘abnormal,’ in a statistical way. Going beyond the traditional insight of virus detection and fingerprinting to a deeper level of security awareness – and reaching towards a category previously impossible to manage: exploitation through social engineering.

This is a category that has always been difficult to solve because of the human element. By utilizing normalized comparisons across your organization, you can look for the ‘edge case’ of system utilization. Find individuals in your organization who may be leaking data willingly or unwillingly due to their impact on your organization’s infrastructure.

Application Resource Utilization

Comparison of resource utilization over two periods

It is from this basis of abnormal usage, or the standard deviation of usage across your organization, that can start building the insight on how your resources are used. And being at the process level of collection, we can also add to this insight the footprint of the user to the network and internet.

Application Socket Usage and Traffic per IP

Breakdown of socket usage with traffic per IP for an application

Lastly, we can show you those connection points and their total usage, including data sent/received from each location and the processes that own those connections. All of this with zero configuration.

I agree with Edward Haletky in that we’ll see more tools in the future for solving security problems in the organization that can’t be caught by traditional virus and spyware detection tools. AppFirst’s platform is leading the charge in helping these innovative organizations change for the better: understanding what is normal for your application and organization by revealing your users’ fingerprint in your infrastructure.

Redefining “Application”

During any discussion with the term ‘application,’ what is the first thing that comes to mind? For many, it’s a traditional piece of software like Microsoft Office, Adobe Creative Suite or perhaps server software like Apache. In the world of AppFirst, we accept those traditional definitions, but we’ve decided to take it a step further.

The application is the heart of your business. It consumes all the resources of your company – both in business and technical operations. However, the application is not a single piece of software or server. The application is a collection of components which frequently includes commercial offerings: load balancers, databases, and queuing systems. It spans many servers, and often spans locations and occasionally other organizations as well.

Having such a widely distributed resource list presents an interesting challenge for knowing the true footprint of your application. This is especially important when it comes to hardware provisioning, system migrations and planning scalability. AppFirst meets this challenge by exploiting the real value of your application as defined by a multi-dimensional analysis.

“Knowing your application’s resource requirements and use-cycle lets you plan for transition with a solid foundation.”

What do I mean with multi-dimensional analysis? By examining your application at the process level, we get a tremendous amount of information about the interfacing resources. We gain insight into whether components of your application are memory-bound, network-bound, or cpu-bound. We show you this for each component across the entire pipeline of information that makes up your application. By thoroughly defining this pipeline, you can also see the resource requirement in aggregate.

Now there are other solutions that show you snapshots of your application, giving you “worst case” scenarios and try to prepare you with reports. The underlying problem is these fail to examine the full breadth of your application over your use-cycle: your application throughout high load, maintenance, and normal load periods. Knowing your application’s resource requirements and use-cycle lets you plan for transition with a solid foundation.

Ultimately, business decisions are made around the footprint of the application, whether provisioning servers, preparing for the dreaded move to “the cloud,” or just providing a business case to request new hardware. Our multi-dimensional analysis will give you solid information to make the case and provide evidence in both the preparation and post-purchase phase.

Instrumenting your Application on AppFirst using Windows Performance Counters

Windows Performance Counters are a great way of instrumenting your application with metrics that Microsoft already provides through their performance counter system. I’ll walk through instrumenting a .Net web application called “nopCommerce.”

The first thing we need to do is find the Windows Performance Counters, and while you can use the command line tool ‘typeperf.exe,’ I liked seeing the graphs to verify I had the right counter.

With Performance Monitor Open, add a new chart item:

With the Add Counter window open, find the counter in the list. In this example, I’m using MSSQL$SQLEXPRESS:Databases:

The next step is to select the instance you want information for. To populate the instance window you need to click on the counter name:

My instance was nopcommerce. If you add the chart now, you’ll get ALL of the counters in the Database container for that instance. If you only want to use specific metrics, select the + to drill down to the collectors for that container.

You can CTRL click on either instances or counters to add multiple items to your chart. For this example I’m only interested in the transactions / sec:

At this point you can add the counter to your graph and see if information is providing the values you want to see. The next step is to verify the full path to your counter. In perfmon, select the properties button at the top of the graph:

The properties window will display the full path to the counter:

 

 

 

 

 

 

 

 

 

 

 

With this information, we have all we need to tell AppFirst to collect this counter. We drop the first \ and replace the \ before our specific metric with an =. The line should read: “MSSQL$SQLEXPRESS:Databases(nopcommerce956)=Transactions/sec”. We add this line to our polled-data config file located under Administration – Collectors. Select edit for the collector you want to collect counters on:

Select the Polled Data Config tab, and add the line as highlighted:

Save the file. When the collector sends data next, it will see the change on its next delivery and sync it to the system, reading the new options. At this point I recommend going to Workbench – Servers:

 

 

 

 

 

Under the polled data status area on your server, you’ll see the counter show up as soon as the process is complete:

And that’s it. Try for yourself by signing up for AppFirst today!

Lessons from Distill

Last week was Engine Yard’s Distill 2013 Cloud Conference. It was two days of Keynotes, breakout sessions and food with some tremendous players in Business and Development. From Nolan Bushnell, founder of Atarti, to Richard Rodger, the COO of NearForm, there was a lot of great keynotes and sessions on development, a vision for the Cloud, and having a bit of fun as geeks.

What I found interesting was a common thread among sessions for modular, fast paced development, whether it was Fred George’s talk on Anarchy, Richard Rodger’s talk on Rapid Development or Richard Watson on Cloud-Aware Applications.

The cloud is a new and drastically different way to see your business as an infrastructure. We have done away with our own data centers and hardware in pursuit of flexibility, elasticity and scalability. It isn’t just about hardware. The startup culture has thrived in this new dynamic – companies like Netflix, who were born in the cloud, have used it as a way to redefine operations from “mean-time between failure” to “mean-time to recovery.” They were raised with unstable resources and instead of lamenting the failure of the cloud, they rewrote how they approached infrastructure and in turn changed the meaning of software development.

Infrastructure gave way to IaaS and PaaS. And with all the criticism, it is likely to stay. This change isn’t alone. Startups have found ways to create a cloud of their development teams. In many of the conversations I had, the developers, the company – they were people spread to the wind. They weren’t commuting into one office, they were national or international and relied on collaborative tools to help build those interoffice bonds. The only thing the developer needs is a computer and internet access and the startup is moments from giving birth to a new product. Google sees innovation as a scientific measure of interactions. Even the CEO of Yahoo has pulled back remote workforce. But what I saw at Distill was successful startups with people spread from Japan to the Netherlands, Australia to South Dakota. Is this a fundamental shift? I wouldn’t be able to say, but it may come across our screens some day as DaaS – Development as a Service.

I’ve mentioned IaaS and maybe someday DaaS, but a lot of the talk in sessions and during cocktails was about development itself. What is the best way to deploy onto the cloud, what is the best way to write reusable code, how do we respond rapidly to business change? The answer seemed to be micro-services. By creating small individual pieces, programs that did one thing very well – we reduce overhead and increase scalability. We give meaning to elasticity. Developing hundreds to thousands of small programs, removing deep levels of connescence (thank you Jim Weirich), and leveraging developer driven re-examination and re-factoring we create responsive, elastic cloud applications.

Ultimately I came away from Distill with two things: 1 – Distill was a great name for summarizing the state of change in development. 2 – The cloud is much more than a collection of servers you can rent when you need. Breaking down infrastructure, software stacks and developers we have created an environment where not only is the application responsive and dynamic, but so in the infrastructure that supports it.