A customer of ours recently told us about their use-case for extending the AppFirst collector to support the data center team. Ops was already using AppFirst’s platform to collect, aggregate and correlate massive amounts of application & operations data for monitoring & troubleshooting. However, there was glaring concern about how long it would take the data center guys (and gals) to log into their own system and check the current status and history of their hardware data.
In turn, they extended the AppFirst collector to capture IPMI sensor data to monitor the environmental trends of their physical hardware. Taken together they were able to see key metrics from the physical hardware all the way up to business performance.
Specifically, the data center team needed to track a few key metrics and analyze performance over time to see if there is a trend towards downward failure.
Fans: Are the system’s fans about to fail, working too hard or experiencing degradation?
Temperature: Is the system temperature anything but stable?
Voltage: Has a power supply gone bad? Specifically, the Power Good signal, which prevents a computer from attempting to operate on improper voltages and damaging itself by alerting it to improper power supply(1).
How to get started
First, a little background. AppFirst’s collector supports the ingestion of multiple types of additional data: logs, statsd and polled data. Polled data can include scripts with a nagios output format to collect data from APIs, management interfaces, SNMP, IPMI or any other source of data available.
Collecting IPMI Data
AppFirst supports server hardware equipped with a BMC controller, with a variety of hardware sensor data that can be monitored using the IPMI standard. IPMI sensor data can be gathered in-band (through the host O/S), or out-of-band (through a dedicated lan connection independent of the system processor and host O/S). A free tool, FreeIPMI, is available to examine IPMI data, and AppFirst has written a series scripts that allow you to take advantage of AppFirst’s big-data store and visualization tools to easily monitor this data.
Here is an example that uses in-band communications (but out-of-band works also). You can do this two ways.
With polled data:
With log data:
To run either method you will need to install FreeIPMI and OpenIPMI on your system.
You are able to use source from here: FreeIPMI, or you may be able to find an rpm package compatible with your OS.
Before moving any further along, make sure that you can execute the ipmi-sensors command. The output looks like this (but much longer):
[appfirst@servername: ~]$ sudo /usr/sbin/ipmi-sensors
ID | Name | Type | Reading | Units | Event
1 | Pwr Unit Status | Power Unit | N/A | N/A | 'OK'
2 | IPMI Watchdog | Watchdog 2 | N/A | N/A | 'OK'
Polled Data Configuration:
Once you have ipmi-sensors working, you will need a perl script, check_ipmi_sensors, to format the output to be Nagios-compatible. We currently install the required script at the same time you installed your collector, and it is located here:
If not, its available here: https://github.com/appfirst/nagios-plugins/blob/master/check_ipmi_sensors
You can test that it’s working by executing this command:
sudo /usr/share/appfirst/plugins/libexec/check_ipmi_sensors -t fan
and you should get output that looks like this:
fan OK | System_Fan_1=2156.00;System_Fan_2=2156.00;System_Fan_3=2156.00;
From your AppFirst UI, select Admin → Setup → Polled Data from the top menu, and locate your collector. Click on the Server Hostname and add lines similar to these to the config file:
command[sensor_fan]=/usr/share/appfirst/plugins/libexec/check_ipmi_sensors -t fan
command[sensor_temp]=/usr/share/appfirst/plugins/libexec/check_ipmi_sensors -t temperature
command[sensor_voltage]=/usr/share/appfirst/plugins/libexec/check_ipmi_sensors -t voltage
The -t parameter is the sensor type. To get a list of sensor types, execute:
sudo /usr/sbin/ipmi-sensors -L
Note that your hardware may not report all of these.
Save the file, giving the collector up to five minutes to update with the new config to start polling the device. In the Correlate tool, you will soon find the new polled data commands that can be displayed.
Log Data Configuration:
You can monitor the output of any command as log data by simply capturing it’s output and appending it to a file, and then configuring the collector to monitor that file as a log file. One way to do that would be to execute the command as a cron job. You would also want to configure logrotate to prevent the excess consumption of disk space.
An alternative is to use our simple bash script, poll2log, and leave the storage to us. It simulates log rotation, and only stores the output from one execution. This script should have been installed when you installed your collector, and should be located here:
But if not, you can find it here: https://github.com/appfirst/nagios-plugins/blob/master/poll2log
Add a line like this to your crontab (probably as root):
*/5 * * * * /usr/share/appfirst/plugins/libexec/poll2log "/usr/local/sbin/ipmi-sensors
From your AppFirst dashboard, select Administration | Logs, and click the Add Log button (upper left). Find your server in the pull-down list, set the Type to “File”, and the File Path to /var/log/ipmi-sensors.log. Save the configuration, and give the collector a few minutes to receive the configuration. In the Correlate tool, you will soon find the new log files that can be displayed.
You can monitor the ipmi sensor data from another machine that has a collector installed. This allows you to see the sensor data in the case where the target machine does not have a collector installed. You simply need to add the host information to the command line:
command[sensor_fan]=/usr/share/appfirst/plugins/libexec/check_ipmi_sensors -h 10.7.7.7 -s “-u username -p password” -t fan
Alternatively, this can also be an environment variable:
export IPMI_USER=user" with "-u $IPMI_USER
Links that may be of interest:
Intel BMC Web Console Users Guide:http://download.intel.com/support/motherboards/server/sb/intel_rmm4_ibwc_userguide_r2_72.pdf
(1) Wikipedia: http://en.wikipedia.org/wiki/Power_good_signal