Monitoring IPMI Sensor Data

A customer of ours recently told us about their use-case for extending the AppFirst collector to support the data center team. Ops was already using AppFirst’s platform to collect, aggregate and correlate massive amounts of application & operations data for monitoring & troubleshooting. However, there was glaring concern about how long it would take the data center guys (and gals) to log into their own system and check the current status and history of their hardware data.

In turn, they extended the AppFirst collector to capture IPMI sensor data to monitor the environmental trends of their physical hardware. Taken together they were able to see key metrics from the physical hardware all the way up to business performance.

Specifically, the data center team needed to track a few key metrics and analyze performance over time to see if there is a trend towards downward failure.

  1. Fans: Are the system’s fans about to fail, working too hard or experiencing degradation?

  2. Temperature: Is the system temperature anything but stable?

  3. Voltage: Has a power supply gone bad? Specifically, the Power Good signal, which prevents a computer from attempting to operate on improper voltages and damaging itself by alerting it to improper power supply(1).

How to get started

First, a little background. AppFirst’s collector supports the ingestion of multiple types of additional data: logs, statsd and polled data. Polled data can include scripts with a nagios output format to collect data from APIs, management interfaces, SNMP, IPMI or any other source of data available.

Collecting IPMI Data

AppFirst supports server hardware equipped with a BMC controller, with a variety of hardware sensor data that can be monitored using the IPMI standard. IPMI sensor data can be gathered in-band (through the host O/S), or out-of-band (through a dedicated lan connection independent of the system processor and host O/S). A free tool, FreeIPMI, is available to examine IPMI data, and AppFirst has written a series scripts that allow you to take advantage of AppFirst’s big-data store and visualization tools to easily monitor this data.

Here is an example that uses in-band communications (but out-of-band works also). You can do this two ways.

With polled data:

With log data:

To run either method you will need to install FreeIPMI and OpenIPMI on your system.

You are able to use source from here: FreeIPMI, or you may be able to find an rpm package compatible with your OS.

Before moving any further along, make sure that you can execute the ipmi-sensors command. The output looks like this (but much longer):

[appfirst@servername: ~]$ sudo /usr/sbin/ipmi-sensors
ID | Name                     | Type                              | Reading   | Units  | Event
1   | Pwr Unit Status    | Power Unit                    | N/A           | N/A     | 'OK'
2  | IPMI Watchdog    | Watchdog 2                  | N/A           | N/A     | 'OK'

Polled Data Configuration:

Once you have ipmi-sensors working, you will need a perl script, check_ipmi_sensors, to format the output to be Nagios-compatible. We currently install the required script at the same time you installed your collector, and it is located here:

/usr/share/appfirst/plugins/libexec/check_ipmi_sensors

If not, its available here: https://github.com/appfirst/nagios-plugins/blob/master/check_ipmi_sensors

You can test that it’s working by executing this command:

sudo /usr/share/appfirst/plugins/libexec/check_ipmi_sensors -t fan

and you should get output that looks like this:

fan OK | System_Fan_1=2156.00;System_Fan_2=2156.00;System_Fan_3=2156.00;
System_Fan_4=2156.00;Processor_1_Fan=2450.00

From your AppFirst UI, select Admin → Setup → Polled Data  from the top menu, and locate your collector.  Click on the Server Hostname and add lines similar to these to the config file:

command[sensor_fan]=/usr/share/appfirst/plugins/libexec/check_ipmi_sensors -t fan
command[sensor_temp]=/usr/share/appfirst/plugins/libexec/check_ipmi_sensors -t temperature
command[sensor_voltage]=/usr/share/appfirst/plugins/libexec/check_ipmi_sensors -t voltage

The -t parameter is the sensor type. To get a list of sensor types, execute:

sudo /usr/sbin/ipmi-sensors -L

Note that your hardware may not report all of these.

Save the file, giving the collector up to five minutes to update with the new config to start polling the device. In the Correlate tool, you will soon find the new polled data commands that can be displayed.

Log Data Configuration:

You can monitor the output of any command as log data by simply capturing it’s output and appending it to a file, and then configuring the collector to monitor that file as a log file. One way to do that would be to execute the command as a cron job. You would also want to configure logrotate to prevent the excess consumption of disk space.

An alternative is to use our simple bash script, poll2log, and leave the storage to us.  It simulates log rotation, and only stores the output from one execution. This script should have been installed when you installed your collector, and should be located here:

/usr/share/appfirst/plugins/libexec/poll2log

But if not, you can find it here: https://github.com/appfirst/nagios-plugins/blob/master/poll2log

Add a line like this to your crontab (probably as root):

*/5 * * * * /usr/share/appfirst/plugins/libexec/poll2log "/usr/local/sbin/ipmi-sensors
             --output-sensor-state" /var/log/ipmi-sensors.log

From your AppFirst dashboard, select Administration | Logs, and click the Add Log button (upper left). Find your server in the pull-down list, set the Type to “File”, and the File Path to /var/log/ipmi-sensors.log. Save the configuration, and give the collector a few minutes to receive the configuration. In the Correlate tool, you will soon find the new log files that can be displayed.

Out-of-Band Configuration:

You can monitor the ipmi sensor data from another machine that has a collector installed. This allows you to see the sensor data in the case where the target machine does not have a collector installed. You simply need to add the host information to the command line:

command[sensor_fan]=/usr/share/appfirst/plugins/libexec/check_ipmi_sensors -h 10.7.7.7 -s “-u username -p password” -t fan

Alternatively, this can also be an environment variable:

export IPMI_USER=user" with "-u $IPMI_USER

Links that may be of interest:

Intel BMC Web Console Users Guide:http://download.intel.com/support/motherboards/server/sb/intel_rmm4_ibwc_userguide_r2_72.pdf

(1) Wikipedia: http://en.wikipedia.org/wiki/Power_good_signal

The Road to Comprehensive Monitoring & Immediate Insight

Sasha Jeltuhin, CTO of Bridgevine, presented at the Gartner Data Center, Infrastructure and Operations Summit today in Las Vegas. During his case study presentation, Sasha explains:
  1. Their makeover of infrastructure, software-upgrades and real-time application monitoring
  2. The key audiences and requirements for data collection and monitoring
  3. Why they selected AppFirst’s application operations platform
  4. Results to date

Below are the highlights from the presentation and reactions from those in the room…


Learn more about the requirements for web-scale monitoring in this brief, featuring Gartner research.

 

Bridgevine Selects AppFirst for Critical Application Data and Insights

AppFirst, the leader in unified application operations visibility, today announced that Bridgevine has selected its technology to deliver critical system insights across the entire stack. The implementation allows Bridgevine to make critical forward-thinking decisions with significantly improved visibility.

Bridgevine selects AppFirst for critical application and systems monitoring

Bridgevine is a technology company that is committed to ensuring its clients never encounter a negative customer experience or disjointed interaction. Selecting the right technology partner to provide granular visibility across applications and infrastructure was a strategic decision for enabling the highest level of business uptime. After identifying detailed requirements and testing multiple solutions, they chose AppFirst.

“We drive 50+ million consumer touch points and enable millions of digital enrollments every year from our national catalogue of service providers,” said Sasha Jeltuhin, CTO of Bridgevine. “Due to the scale of our transactions and their importance to our business, we needed a miss nothing approach toward proactive monitoring of the platform, complex integrations with service providers and metric collection. This ensures we have a deterministic view of our environment and can see the details of every event across the stack, regardless of when they occur. As a company focused on delivering high quality services, legacy polled metric collection missed our requirements for data velocity, quality, operational efficiency and immediate insight.”

“With AppFirst, Bridgevine can quickly identify and solve for the root cause of any application or operational issue—whether it is in-house or from a third party service,” said David Roth, CEO of AppFirst. “The end result is a significantly improved customer experience and an improved bottom line.”

With AppFirst, organizations can see every interaction between the operating system and application components, regardless of when they occur, without impacting application performance. Installed in seconds, AppFirst’s patented collection technology eliminates the need for multiple agents or byte-code injection. All information is aggregated and time-synchronized in a big data platform to provide collaborative visibility across all groups. This delivers multiple benefits such as:

  • Dynamic topology mapping of applications and their associated server infrastructure.
  • Visualization of network data and transaction flows, along with their low level event details.
  • Deterministic information for capacity management, improved asset utilization and more.

About Bridgevine

Bridgevine, Inc. is the leading provider of customer acquisition and retention solutions to enterprise customers. The company’s SaaS-based platform is used by the leading companies in communications, entertainment, home security, and energy resulting in 50+ million consumer interactions annually. For more information, visit www.bridgevine.com.

Announcing The Enterprise-Grade Platform & Predictive Analytics Partnership

Today we are unveiling the enterprise version of the AppFirst platform with the option to securely run on-premise, in the cloud or in hybrid scenarios. This next-generation of our platform comes with many other exciting announcements, including a partnership with Accretive to deliver a predictive analytics offering to help enterprises expose, anticipate and intervene in IT risks across organizational silos. This partnership, coupled with AppFirst’s enterprise-grade platform, provides unparalleled insight and predictive capabilities across all layers of an organization’s application stack.

Starting with Accretive, over the coming weeks and months we will announce a series of enterprise focused applications leveraging this platform, consisting of those developed by AppFirst, various partners and the enterprise community.

The Platform: A Universal Timeline of Events

AppFirst’s enterprise platform represents a paradigm shift from traditional monitoring and logging tools. It establishes a complete and universal timeline of events across an enterprise application environment – at a sub-millisecond level. This includes every application call, system event, log file entry, configuration change, third party application or custom code event, and data from thousands of plug-ins. This is accomplished through a patented data collection and aggregation methodology that provides an unmatched level of visibility – all correlated into a universal timeline that can live in the cloud, on-premise or in a hybrid environment. Armed with this platform, enterprises are now able to achieve granular and continuous visibility between IT and the business

“AppFirst is redefining what’s possible for executives looking to optimize the technology that runs their business. And it starts with setting a single sub-millisecond timeline across the entire enterprise stack. Unless you start with perfect data, you cannot achieve true visibility,” said David Roth, CEO and Co-Founder of AppFirst. “We are applying deep predictive analytics and applications on top of the most detailed data in the world.  Taken together, this finally offers CIOs the ability to truly manage cost, risk and capacity without fingerpointing or guessing.”

New capabilities of the AppFirst platform include:

  • Real-time service topology mapping
  • Predictive analytics
  • On-premise deployments

Real-time service topology mapping

Customers now have the ability to auto-discover and map their service and application topologies in real-time. AppFirst auto-installs under new custom or third-party application components as they come into your environment. Therefore, you’ll always have an up-to-date view of what is running and how it is running across your enterprise.

Real-time service topology mapping

Predictive analytics

By partnering with Accretive to update predictive analytic models in real-time, IT and business leaders can identify system limits, predict issues across the enterprise, and perform what-if analysis on how changing complexity will impact performance quality and cost.

“During the past 12-18 months, the use cases for our platform have extended far beyond incident response and troubleshooting,” continued Roth. “Our partnership with Accretive is the direct result of enterprise organizations looking to holistically optimize the business from the top – all the way down to the database or web server. We are now able to reset the standard for how organizations bridge business to IT.”

“Only the combination of AppFirst and Accretive provides business leaders with the real-time and forward-thinking visibility required for true IT continuity and optimization,” said Dr. Nabil Abuelata, CEO and Founder of Accretive. “Unlike traditional data warehousing and big data analytics technologies, it’s now possible to feed and update predictive models in real-time with a unified dataset to understand what is happening now and in the future. This capability delivers tremendous value to both the core business and IT operations. We are excited to work with AppFirst to change the way enterprises think about optimizing the entire technology chain that supports their business.”

On-premise deployments

In addition to the core SaaS & Private SaaS offering, customers can easily deploy AppFirst behind their firewall. This is critical for enterprise customers in highly regulated environments and in countries where data needs to securely reside within their borders.

AppFirst enterprise platform is now available for customers worldwide. To learn more:

Chat with a Field Engineer or Sign up for a 30 day free trial

View your Server Data faster

Many users loved the level of detail at which they could view server data in the old version of our product – they could quickly choose a server from a list, and immediately see a wide variety of information about processes, logs, metrics, and alerts for that particular server.

We’ve brought back many of the fetures of the old servers page into our new user interface, making it even faster to view important server metrics and data.

Server metrics, log messages, triggered alerts, and polled data are organized into tabs. We’ve also added a new section – Collector Info – which tells you which collector version the server is running and when the collector last uploaded data.

You can use your left and right arrow keys to quickly switch between tabs.

The dropdown above the list of servers lets you filter servers by running/not running, as well as by any server tags or server groups that are configured. You can use your up and down arrow keys to quickly cycle through the list of servers.

The new servers page is available today!

The Value of Context in Web Scale IT

Frequently in the world of monitoring we hear about the miraculous solutions provided from a single agent. How the world can be solved with transaction tracing, user experience monitoring and understanding the performance of the software pipeline as it applies to business.

All of these are great starting points, but today I’ll go through the case where the software pipeline is affected by the environment in which it lives, and application monitoring only shows the symptoms.

For example, my dashboard this morning:

AppFirst Dashboard Frontend Response Time Spike

The specific item in question, “Frontend Response Time” has a high value for a period this morning.

Frontend response time spike

And as I look into this value, it becomes evident that this is being caused by something external to the software stack.

As a quick background, the eCommerce solution I’m running (magento) uses files to keep track of each user as they access the site. A normal activity which isn’t excessively risky. However, due to this activity, thousands of session files exist which quickly eat up all available inode space if no action is taken.

Session files east up all available inode space is no action is taken

Now, performing normal maintenance, removing these excessive files is easy enough. However, a simple ‘rm -f’ command is insufficient (too many items). This is easily remedied with xargs:

Removing excessive files with xargs

However, execution like this takes time. Time that we can see with the data we’ve collected from the system, and graphed in-line with the web stack performance data.

System and web stack performance graph

The blue line is the ecommerce response time, the purple line the system disk utilization, and the orange line is the number of files being accessed by xargs. You can see the clear correlation between running the file removal on the response time, as well as the reduction in storage space. As soon as the files have been removed, the response time recovers.

If you’d like to stop treating symptoms and see how the environmental context affects your business, sign up for a free 30 day trial at http://www.appfirst.com/signup/.