Commit 40115205 authored by Blerim Sheqa's avatar Blerim Sheqa
Browse files

Migrate 'Service Monitoring' part 1

parent e51c75d4
# General Monitoring
Everything which does not fit a special category.
# check_mathlm
**Info:** on questions, get in touch with the author!
http://cf.lehigh.edu/~kbe2/check_mathlm
```
#!/usr/bin/perl
# plugin to check Mathematica's license manager. requires their 'monotirlm' binary...
# I have no idea where you can get this. Ask whoever set up the license server.
#
# If you improve on this, please send me a copy: keith@lehigh.edu
#
# Let's call this v0.1, 4/2012
use Nagios::Plugin;
$monlm = "/usr/local/bin/monitorlm";
$np = Nagios::Plugin->new(
usage => "Usage: %s [-v|--verbose] -H|--host <host> [-t|--timeout <timeout>]",
);
$np->add_arg( spec => 'host|H=s', help => "--host|-H\n host name", required => 1);
$np->getopts;
alarm $np->opts->timeout;
$host = $np->opts->host;
$verbose = $np->opts->verbose;
open MON, "$monlm $host 2>&1 |";
%lic = {};
while (<MON>) {
if ($verbose) {
print $_;
}
#for "Could not contact MathLM server...", "Could not find a MathLM server", etc.
if ($_ =~ /^Could not/) {
#BAIL OUT, BAIL OUT
$np->nagios_exit(CRITICAL, $_);
}
#distgusting regex to pluck out the license counts
if ($_ =~ /^((?:Sub )?Math(?:ematica|Kernel))\s*\w\s*(\d+)\s*(\d+)/) {
$lic->{$1}->{auth} = $3;
$lic->{$1}->{used} = $2;
}
}
close MON;
$status = OK;
#this loop could/should probably use the add_message() functionality, but it's marked "experimental"...
while( ($prod, $count) = each %$lic ) {
if ($verbose) {
print "found: $prod, licenses: $count->{'used'} / $count->{'auth'}\n";
}
if ($count->{'used'} >= $count->{'auth'}) {
$status = WARNING;
$statmsg .= "All $prod licenses in use. ";
}
}
if ($status != OK) {
$np->nagios_exit($status, $statmsg);
}
#only return OK if we actually found some licenses...
if ( keys( %$lic ) ) {
$np->nagios_exit(OK, "License server okay.");
}
#otherwise, I have no idea what the status is
$np->nagios_exit(UNKNOWN, "License server responded, but I didn't find any usage numbers?");
```
# High Availability Monitoring
* [check_drbd](01_05_01_check_drbd.md)
* [check_heartbeat](01_05_02_check_heartbeat.md)
* [check_pacemaker](01_05_03_check_pacemaker.md)
* [check_master](01_05_04_check_master.md)
* [check_corosync](01_05_05_check_corosync.md)
# check_drbd
## Know How
### Guides
* http://blog.simon-meggle.de/tutorials/nagiosomd-cluster-mit-pacemakerdrbd-teil1/
* http://blog.simon-meggle.de/tutorials/nagiosomd-cluster-mit-pacemakerdrbd-teil2/
* http://blog.simon-meggle.de/tutorials/nagiosomd-cluster-mit-pacemakerdrbd-teil3/
* http://blog.simon-meggle.de/tutorials/nagiosomd-cluster-mit-pacemakerdrbd-teil4/
* http://blog.simon-meggle.de/tutorials/nagiosomd-cluster-mit-pacemakerdrbd-teil5/
* http://blog.simon-meggle.de/tutorials/nagiosomd-cluster-mit-pacemakerdrbd-teil6/
### Scripts
* Perl, nsca/nrpe:
* http://www.monitoringexchange.org/inventory/Check-Plugins/Operating-Systems/Linux/check_drbd
* Perl, ssh:
* http://www.monitoringexchange.org/inventory/Check-Plugins/Operating-Systems/Linux/check_drbd2
* Perl, rather old:
* http://www.monitoringexchange.org/inventory/Check-Plugins/Software/Clusters/check_drbd-pl
## Installation
### check_drbd
```
# cd /usr/local/
# mkdir -p icinga/libexec ; cd icinga/libexec
# wget https://www.monitoringexchange.org/attachment/download/Check-Plugins/Operating-Systems/Linux/check_drbd/check_drbd
# chmod +x *
```
### check_drbd
```
# cd /usr/local/
# mkdir -p icinga/libexec ; cd icinga/libexec
# wget http://doguet.com/pub/centreon/nagios_libs.pm
# wget https://www.monitoringexchange.org/attachment/download/Check-Plugins/Software/Clusters/check_drbd-pl/check_drbd-pl
# vim check_drbd-pl
#use lib "/app/scripts/nagios";
use lib "/usr/local/icinga/libexec";
# chmod +x *
```
## Tests
### check_drbd
```
pgbaer1:/usr/local/icinga/libexec# ./check_drbd
DRBD OK: Device 0 Primary Connected UpToDate
```
### check_drbd-pl
```
pgbaer1:/usr/local/icinga/libexec# ./check_drbd-pl --role Primary/Secondary
CRITICAL Can't open /tmp/drbdout
```
/tmp/drbdout gibts ned ?!? => **kaputt**
# check_heartbeat
## Know How
* http://www.monitoring-portal.org/wbb/index.php?page=Thread&postID=88869#post88869
* http://www.monitoring-portal.org/wbb/index.php?page=Thread&postID=107708#post107708
* http://www.monitoring-portal.org/wbb/index.php?page=Thread&postID=90634#post90634
Also available in check_mk, but not askable via SNMP (and we do not want to install yet another agent, if we got NRPE and SNMP).
```
# vim /usr/share/check_mk/checks/heartbeat_nodes
# vim /usr/share/check_mk/checks/drbd
```
* http://www.randombugs.com/linux/howto-monitor-linux-heartbeat-snmp.html
* http://www.freebsdcluster.org/~lasse/files/LINUX-HA-MIB.mib
* Perl, nsca/nrpe:
* http://www.monitoringexchange.org/inventory/Check-Plugins/Operating-Systems/Linux/check_heartbeat_link
* Perl, hp-ux link:
* http://www.monitoringexchange.org/inventory/Check-Plugins/Software/Clusters/Snmp-Heartbeat-check
* Bash, SNMP:
* http://www.monitoringexchange.org/inventory/Check-Plugins/Software/Clusters/Snmp-Heartbeat-check
* Bash, SNMP:
* http://www.monitoringexchange.org/inventory/Check-Plugins/Software/Clusters/Check_snmp_heartbeat_resources
* More and Alternatives:
* http://exchange.nagios.org/index.php?option=com_mtree&task=search&Itemid=74&searchword=heartbeat
## Installation
Like [DRBD](01_05_01_check_drbd.md)
```
# cd /usr/local/
# mkdir -p icinga/libexec ; cd icinga/libexec
# wget http://wiki.debuntu.org/w/images/b/b4/Check_heartbeat.sh -O check_heartbeat.sh
# wget https://www.monitoringexchange.org/attachment/download/Check-Plugins/Operating-Systems/Linux/check_heartbeat_link/check_heartbeat_link
# chmod +x check_heartbeat.sh check_heartbeat_link
```
### SNMP
```
# cd /opt/nagios/libexec/
# wget https://www.monitoringexchange.org/attachment/download/Check-Plugins/Software/Clusters/Snmp-Heartbeat-check/check_snmp_heartbeat.sh
# chmod +x check_snmp_heartbeat.sh
```
```
# vim /etc/default/snmpd
# snmpd options (use syslog, close stdin/out/err).
#SNMPDOPTS='-Lsd -Lf /dev/null -u snmp -I -smux -p /var/run/snmpd.pid 127.0.0.1'
SNMPDOPTS='-Lsd -Lf /dev/null -u snmp -I -smux -p /var/run/snmpd.pid'
```
In order to let hbagent as sub agent connect to masterx, the snmpd must be configure as master.
```
master agentx
trap2sink localhost
```
restart snmpd
```
# /etc/init.d/snmpd restart
```
configure /etc/heartbeat/ha.cfg hbagent respawn.
```
# sub agent for snmp respwan hbagent
respawn root /usr/lib/heartbeat/hbagent -d
```
vgl http://books.google.at/books?id=1PEqrGYcGBgC&pg=PT232&lpg=PT232&dq=hbagent+respawn&source=bl&ots=OWzOZTZY6X&sig=h_Rm9Tr3nB1C7BfZVSxVUgGzE6Y&hl=de&ei=xgXATb-GOoTvsgbku9zDBQ&sa=X&oi=book_result&ct=result&resnum=2&ved=0CCgQ6AEwAQ#v=onepage&q=hbagent%20respawn&f=false
## Tests
### SNMP
```
# snmpwalk hostname -v 2c -c xxxx .1.3.6.1.4.1.4682.1
SNMPv2-SMI::enterprises.4682.1.1.0 = Counter32: 3
SNMPv2-SMI::enterprises.4682.1.2.0 = Counter32: 2
SNMPv2-SMI::enterprises.4682.1.3.0 = INTEGER: 3
SNMPv2-SMI::enterprises.4682.1.4.0 = Counter32: 1
```
### check_snmp_heartbeat.sh
```
# ./check_snmp_heartbeat.sh -H host -C xxxxx
HEARTBEAT lost some nodes !
```
### check_snmp_heartbeat_resources
:warning does not work properly
```
# ./check_snmp_heartbeat_resources -H host -C xxxx -R 2
```
### check_heartbeat_link
```
# ./check_heartbeat_link
Heartbeat Link OK: host:eth1:up 1.2.3.4:1.2.3.4:up
```
# check_pacemaker
## Knowhow
> crm configure show: dump current config to stdout
>
> crm configure save <file>: dump current config into file
## Checks
These output could be used for monitoring.
```
# crm_mon --help
```
### check_crm
Base Check.
http://exchange.nagios.org/directory/Plugins/Clustering-and-High-2DAvailability/Check-CRM/details
[check_crm_v0_5](attachments/check_crm_v0_5)
```
# visudo
nagios ALL=NOPASSWD: /usr/sbin/crm_mon -1 -r -f
```
### check_pcmkactions
check failed actions
http://www.s-reimann.com/2011/06/30/nagios-check-pacemaker-failed-actions/
```
#!/bin/bash
#
# title: check_pcmkactions - check for failed pacemaker actions
# date created: Tue Jan 25 2011
# last edit: Thu Jun 30 2011
# author: Sascha Reimann
# changelog: - crm_mon, awk & grep put into variables.
# nagios returncodes:
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
crm_mon="/usr/sbin/crm_mon"
awk="/usr/bin/awk"
grep="/bin/grep"
# check for failed actions and set $STATUS
$crm_mon --one-shot | $grep --quiet "Failed"
STATUS=$?
# generate output:
if [ ${STATUS} -eq 0 ]
then
DETAILS=$($crm_mon --one-shot | $awk '/Failed/ {f=1}f' | $grep --invert-match Failed)
COUNT=$($crm_mon --one-shot | $awk '/Failed/ {f=1}f' | $grep --invert-match --count Failed)
echo "CRITICAL: ${COUNT} failed action(s): ${DETAILS}"
exit ${STATE_CRITICAL}
elif [ ${STATUS} -eq 1 ]
then
echo "OK: no failed actions found"
exit ${STATE_OK}
else
echo "UNKNOWN: returncode ${STATUS}"
exit ${STATE_UNKNOWN}
fi
```
### check-pacemaker-config
Passive Check, via Cronjob
#### Prerequisites
#### Cachedir
`# mkdir /var/cache/pacemaker`
#### Template
Mandatory Template as master, which is diffed onto.
```
/var/cache/pacemaker/pacemaker-config.template
```
Put the current config into the template
```
# crm configure save /var/cache/pacemaker/pacemaker-config.template
```
#### Setup
* dump current pacemaker config
* diff aginst template
* if diff available, alarm
* extract the diff, do linecount
* if diff is not possible (no dump), die with unknown
#### Code
`# vim /usr/local/bin/check-pacemaker-config`
```
#!/usr/bin/perl
#working dir
my $cachedir = '/var/cache/pacemaker';
#save this once with # crm configure save /var/cache/pacemaker/pacemaker-config.template
my $templatecfgfile = $cachedir.'/'.'pacemaker-config.template';
#current dumped config
my $cfgfile = $cachedir.'/'.'pacemaker-config';
#nsca information
my $icinga_nscahost = 'icingahost';
my $icinga_nscacfg = '/etc/send_nsca.cfg';
#icinga information
my $icinga_host = 'host';
my $icinga_svc = 'db-pacemaker-config';
my $icinga_status = 0; # OK
my $icinga_output = "OK";
my $icinga_perfdata = "lines=0";
my $icinga_longoutput = "";
my $icinga_complete_output = "";
my $lines = 0;
#check if there was a previous leftover
if(-e $cfgfile) {
unlink($cfgfile);
}
#dump it
`/usr/sbin/crm configure save $cfgfile`;
#check if the dump happened and then diff, bail out unknown if not
if(-e $cfgfile) {
#diff it to the teplate, e.g. gotten from git
$diff = `diff $templatecfgfile $cfgfile`;
#do something if the diff is there
if ($diff ne "") {
#print "diff is $diff";
# FIXME do something with longoutput - current NSCA does not support multiline
$icinga_longoutput = `echo "$diff" | egrep "^<|>"`;
$lines = $icinga_longoutput =~ tr/\n//;
$icinga_status = 2; # CRITICAL
$icinga_output = "CRITICAL - config differs ($lines)!";
$icinga_perfdata = "lines=$lines";
# FIXME - escape for send_nsca
#$icinga_longoutput =~ s/\R//g;
$icinga_longoutput = "";
}
} else {
$icinga_status = 3; # UNKNOWN
#print "could not diff this time";
$icinga_output = "UNKNOWN - no diff possible";
$icinga_perfdata = "lines=0";
$icinga_longoutput = "";
}
$icinga_complete_output = "$icinga_output|$icinga_perfdata\n$icinga_longoutput";
`echo "$icinga_host\t$icinga_svc\t$icinga_status\t$icinga_complete_output\n" | /usr/sbin/send_nsca -H $icinga_nscahost -c $icinga_nscacfg`;
print "sending the following to send_nsca...\n";
print "$icinga_host\t$icinga_svc\t$icinga_status\t$icinga_complete_output\n";
```
`# chmod +x /usr/local/bin/check-pacemaker-config`
#### Crontab
```
# vim /etc/crontab
# db-pacemaker-config check
0,15,30,45 * * * * root perl /usr/local/bin/check-pacemaker-config >> /dev/null 2>&1
```
#### Icinga
```
define service {
use passive-reports-template
host_name host
service_description db-pacemaker-config
freshness_threshold 120 # 2h
check_command db-pacemaker-config!No Report Received!
}
```
# check_master
**Warning:** Sort of Failover Redunancy Script Collections. Incomplete.
> Add the Icinga slave script. This script (to run as cron task) checks for Icinga master status.
>
> If it's not OK twice, then the failover instance is enabled.
>
> On the other hand, while failover is running, if master goes back to life, failover is disabled again.
>
> This allows auto recovery of Icinga monitoring.
http://svn.reactos.org/svn/project-tools/trunk/nagios/check_master.sh
```
#!/bin/bash
#
# File: check_master.sh
# Author: Pierre Schweitzer <pierre@reactos.org>
# Created: 18 Jun 2012
# Licence: GNU GPL v2 or any later version
# Purpose: Script to check for master status. In case it is down
# the slave is activated to take over.
#
MAIL=root
STATUS_FILE=/var/lib/icinga/rw/slave.status
CMD_FILE=/var/lib/icinga/rw/icinga.cmd
NRPE=/usr/lib/nagios/plugins/check_nrpe
MASTER=
CHECK=check_icinga
OLD_STATUS=`cat $STATUS_FILE`
# Status can be:
# 0: Not running
# 1: Running
# 2: About to run
# 3: About to stop
$NRPE -H $MASTER -c $CHECK 1>/dev/null 2>&1
STATUS=$?
if [ $STATUS -ge 2 ]; then
if [ $OLD_STATUS -eq 2 ]; then
NOW=`date +%s`
# Reenable checks
echo "[$NOW] START_EXECUTING_HOST_CHECKS;" >> $CMD_FILE
echo "[$NOW] START_EXECUTING_SVC_CHECKS;" >> $CMD_FILE
echo "[$NOW] ENABLE_NOTIFICATIONS;" >> $CMD_FILE
echo "Slave taking over" | mail -s "Icinga: master down" $MAIL
echo 1 > $STATUS_FILE
elif [ $OLD_STATUS -eq 3 ]; then
echo 1 > $STATUS_FILE
elif [ $OLD_STATUS -eq 0 ]; then
echo 2 > $STATUS_FILE
fi
elif [ $STATUS -eq 0 ]; then
if [ $OLD_STATUS -eq 3 ]; then
NOW=`date +%s`
# Disable checks
echo "[$NOW] STOP_EXECUTING_HOST_CHECKS;" >> $CMD_FILE
echo "[$NOW] STOP_EXECUTING_SVC_CHECKS;" >> $CMD_FILE
echo "[$NOW] DISABLE_NOTIFICATIONS;" >> $CMD_FILE
echo "Slave going back to sleep" | mail -s "Icinga: master up" $MAIL
echo 0 > $STATUS_FILE
# Reset Icinga state
/etc/init.d/icinga reload
elif [ $OLD_STATUS -eq 2 ]; then
echo 0 > $STATUS_FILE
elif [ $OLD_STATUS -eq 1 ]; then
echo 3 > $STATUS_FILE
fi
fi
```
# check_corosync
## check_corosync_rings
### NRPE Script.
```
# wget "http://exchange.nagios.org/components/com_mtree/attachment.php?link_id=2397&cf_id=29" -O check_corosync_rings ; chmod +x check_corosync_rings
```
### sudo
```
# visudo
nagios ALL=NOPASSWD: /usr/sbin/corosync-cfgtool -s
```
### NRPE
```
command[check_corosync_rings]=/usr/lib/nagios/plugins/check_corosync_rings
```
### Icinga
```
##corosync
( ( "check_nrpe_command!check_corosync_rings", "corosync rings", True ) , [ "corosync" ], ALL_HOSTS ),
```
# Clustered Checks with Icinga and check_multi
## Setup
### NRPE
Setup NRPE like described here: Setting up NRPE with Icinga
### check_multi
Matthias Flacke created [check_multi](http://my-plugin.de/wiki/projects/check_multi/start) which basically supports multiple checks within a single check command.
We will use that possbility in order to execute 1..n checks and calculate the overall returned state and output (and perfdata in case).
#### Installation
Download check_multi, put it into the location of $USER1$ defined in resource.cfg
```
$ git clone git://github.com/flackem/check_multi ; cd check_multi
$ ./configure --with-nagios-name=icinga --with-nagios-user=icinga --with-nagios-group=icinga
$ make all
$ sudo make install
```
**Info:** If you want to use a dedicated plugins path, add **--libexecdir=/usr/lib/nagios/plugins** to configure.
**Info:** Using Debian packages it's more easy - having backports enabled from Setting up [Icinga with IDOUtils on Debian](../installation-guides/01_01_setting_up_icinga_with_idoutils_on_debian.md)
`# apt-get install nagios-plugin-check-multi`
#### Configuration
**Note:** Paths may differ depending on your installation! The provided paths use the default prefix running configure without any prefix param.
##### Checks
On the remote nodes you will need
* plugins e.g. check_ping
* command definition in nrpe.cfg
* `command[check_ping]=/usr/lib/nagios/plugins/check_ping -H $ARG1$ -w $ARG2$ -c $ARG3$ -p $ARG4$ -4 -t $ARG5$`
* a running NRPE server accepting connections from the Icinga master
* dont_blame_nrpe=1 (having command arguments enabled)
On the Icinga master you will need
* NRPE client
* check_multi
* plugins for local checks
###### check_multi command definition
Define check_multi check commands in order to use them afterwards. It is necessary to tell check_multi which macros should be used and piped into. We are also using a default path for all expected `*.cmd` files ($ARG1$).
**Info:** Debian packages: ***/usr/lib/nagios/plugins/check_multi***
```
define command {
command_name check_multi
command_line /usr/local/icinga/libexec/check_multi -l /usr/local/icinga/libexec -s HOSTADDRESS=$HOSTADDRESS$ -s HOSTADDRESS6=$HOSTADDRESS6$ -f /usr/local/icinga/etc/$ARG1$ $ARG2$ $ARG3$ $ARG4$
}