Posted on 26/11/

I needed to render some RRD graphs from a Catalyst application. Before, I was using rrdcgi. Not that I couldn't use it together with WrapCGI, but I wanted to write the HTML templates in Template Toolkit (as always), because rrdcgi templating is not all that powerful.

So you get the RRDs perl module on one side, and you get Catalyst on the other, a bit of glue, and there you have it: Catalyst::View::RRDGraph

Just put the graph definition on the stash, and call the view. The view outputs images, so you can use them from an HTML page that you have templated in whatever language you want.

<Img src="[% c.uri_for('controller/that/uses/rrdgraph') %]">

As always, feedback is welcome

Working with Ton Voon

Posted on 26/11/

Ton Voon, CTO of Altinity was at CAPSiDE last week. He was here doing joint develpoment on OpsView, and giving us an inside view of the bowels of the beast. I always say that to get implicated with a project, having it's source code is not enough. You have to have a "photo" of the project as a whole, and that is pretty hard to have, because most of the time, it isn't documented anywhere. So that's what Ton has given us. One day I'll blog about having that "photo" of a project...

I have to say I enjoyed Ton's stay here, and it was a great pleasure to work together. His tecnological skills and personal aptitudes (good comunicator, ability to envision solutions that can fit everyones needs, will to work, etc) made his stay here at Barcelona a very productive one.

Beeing able to see how other projects are managed gives you a view of how your projects are managed too, and the problems and considerations that these other projects have, that maybe you don't have actually, but can have someday, or that you can apply to yourself. On that side I'm very fond on how we are managing projects at CAPSiDE and think we are on the right path. Of course there is always room for improvement.

At CAPSiDE we are commited to contributing to OpsView, and will try to apport our grain of salt so that it can evolve into an even better monitoring solution than it is.

Simple Cross-Domain Ajax Proxy

Posted on 26/11/

Developing a feature for one of our products we needed to do retrieve pages from other domains via XMLHttpRequests from the browser. As you already know, browsers don't let you do cross domain requests as a security measure, so you have to use a proxy on the same domain that your application is running.

There are a lot of ways of doing it, and I wanted a way where I didn't have to install additional soft and such. There are php proxies, Java proxies, etc. I didn't want to do a Perl proxy (just to not bloat the solution). There where people doing it with Apache (those ones I liked), but in an unmaintanable way (adding one configuration per domain to retrieve info from), and our application required data to be retrievable from any domain. So here is the recepie I whipped up:

<Proxy http>
    Order Deny,Allow
    Allow from all

RewriteEngine on
RewriteRule ^/web-proxy/(.*)$ $1 [P]

Now you only have to make requests to http://webserver/web-proxy/DESIRED_URL. Please take into account that if you do not protect the web-proxy location to authorized users, you have an open proxy (don't do that).

Opsview Single Sign-On coming soon

Posted on 26/11/

People have been asking on the list to be able to use the Single Sign-On feature implemented in Opsview to authenticate against an LDAP, for example.

I've been trying to get it working with the actual codebase, but I'm sad to say that it's not ready yet. While looking through the code, I found a comment that resolved my doubts:

  # This setting of the user_exists means that Opsview is the central
  # login point, not the authticket
  # Maybe possible in future to allow a trust from the external source
  # so the user can be given from the auth ticket
I love code comments that really help you see the decisions that were made (those are good code comments), although this one was a bit of a show stopper :(

So... the actual codebase can't trust a ticket generated from a 3rd party source. You CAN use the ticket generated by opsview to authenticate on other sources, though, as it's fully valid.

I've contributed changes to Opsview that are awaiting revision. These changes let the Catalyst framework (that Opsview uses) log in the user that is provided through a 3rd party ticket, so if everything goes well, I will be able to show you how to use the Single Sign-On to autheticate Opsview users for the next Opsview release (the article is half-written ;))


Posted on 26/11/

New blog. New blogging software. The last one was movabletype. Not bad... It was written in perl! ;) but it's not totally free and I didn't want something as complicated (some features I didn't need/want/don't want to spend time discovering what they are). I had heard about a quite minimal blogging system, written in Perl, and where almost all functionality is a plugin.

To create a blogging habit things have to be easy (for me). Now i can blog from almost anywhere. I can open an SSH session and blog from vi, install a backend for mobile devices (this article I'm writing from my PDA... while sitting on the couch!).

The blog is going to start out minimal. You can imagine based on the styling... I even don't want comments for now (spam blog comments are horrible to cope with). If you have any comment or suggestion: drop me a mail at

Opsview custom SMS notifications

Posted on 26/11/

Opsview can now use custom SMS notification methods. I've prepared a mini-howto guide on how to use this feature. Please send in comments, corrections and suggestions. This article will be aported to the Opsview docs, so all of us will benefit.

Configuring Opsview

Put your custom SMS notification script into


remember to make it executable to the nagios user. See below for recommendations on how to develop the notification methods.

Sync the plugins to the slaves:


In Opsview interface

Go to: Advanced -> SMS Notifications -> Create new SMS Notification Methods

  • Name: give it an identifier (without spaces)
  • Run On:
    • Monitoring Server: The command will be run on the Master monitoring server. This is for scenarios where you have to notify from a special device, for example, that isn't available on the slaves. A cell phone attached via a serial cable, a server that is only accessible from the master, etc.
    • Slave: This means the command will be run on the slave that has detected the alert. This is for notification services that will not depend on the server that detected the alert, like an HTTP call to an SMS service.
  • Command: the name of the command in the /usr/local/nagios/libexec/notifications directory. Add extra parameters that are supposed to get to the script (parameters that nagios doesn't send you).

Go to: Advanced -> System Preferences, choose your new SMS method identifier, and submit the changes

Be sure to have a contact with the SMS number filled in (that will activate the SMS notifications for that contact). Note that the +CCNNNNNNNNN format is not longer enforced, in fact, no format is enforced, as it will be the plugins responsability to verify the correct format for the number for it's use. Push the "send test SMS" link to try out your notification method.

Reload your Opsview configuration and you're running

More help is available in Opsview docs.

Script guide

The script will recieve the SMS number in the NAGIOS_CONTACTPAGER environment variable, in fact, it can play around with all the environment variables listed in the Nagios Macros Reference. Look in the Service Notifications and the Host Notifications column.

Non-Nagios variables can be expected from the command parameters. Things like --url_to_post_to, --serial-device-to-talk-to --baud-rate, etc, and can be passed when you define the "Command" of the new SMS method.

Do the notification magic, print a line of status to STDOUT to help out humans ;), and exit 0 on success, non-zero on failure.

Note: The Opsview 2.11 standard notification scripts relied on getting the SMS number via the command line with -n parameter (if I don't remember badly). These where changed to be expected through the env variables in Opsview 2.12.

Power to the users

Who says "custom SMS notifications" says "do what you want to notify... you have the control". That is, as long as you fill in the SMS number for a contact, the "SMS" notification will be called for it. You can write a log file instead of sending an SMS if you want... Opsview won't care }:)

Conference feedback

Posted on 26/11/

Sorry for the late post, but I've been quite busy after the Nagios Konferenz. I was preparing one macro-post with all the new things I learned, but I'll just split them so they get published quicker!

The conference was really good, and I met lots of people that use in some way Nagios, apart from main developers, and developers of 3rd party software based on Nagios. The conference was sold out, and it was a pleasure to attend. I hope to be there next time.

I attended:

  • Ethan Galstad: Nagios - Current State, Future Plans and Development Roadmap
  • Geert Vanderkelen: Monitoring MySQL
  • Stefan Kaltenbrunner: PostgreSQL Monitoring - Introduction, Internals And Monitoring Strategies for postgres
  • Ton Voon: An active check on the status of the Nagios Plugins
  • Satish Jonnavithula & Steven Neiman: Application Transaction Monitoring using Nagios
  • Malte Sussdorff: Integrating Nagios and ]project-open[
  • Tom De Cooman: Monitoring Tools Shootout
  • Julian Hein: FLEXible Realtime Graphing with the new NETWAYS Grapher v2

A big thank you to Netways for organizing this great event

Nagios::Plugin::DieNicely v0.02

Posted on 26/11/

Nagios::Plugin::DieNicely now lets you exit with the Nagios status that you most like. The feature was on the Todo list, and now that I'm confident that the tests pass on lots of different perls and platforms (thanks CPAN Testers!), and that I have detected why there are some FAIL test results, and that there have been requests for it, I have decided to add the feature

Compatibility should be assured (at least the test suite says so). If you use the module as in v0.01, the exit code will still be CRITICAL. But if you where not all that comfortable with CRITICAL, and you would like WARNINGs, now you can. Just:

use Nagios::Plugin::DieNicely qw/WARNING/;

You can pass in these identifiers:

  • CRITICAL: The default
  • WARNING: I suppose this one will be the most used...
  • OK: If you use this one, please comment why you would want to do so. I added it just in case someone would want it (I have no cristal ball to say that it isn't useful), and I have not been creative enough to find a real use.
  • UNKNOWN: The purpose of the module is to NOT get UNKNOWNs in Nagios. Why have you done this? Well... If you specify UNKNOWN, you will get the exception in the Nagios output (instead of lost in limbo).

Give it a ride!


Posted on 26/11/

I'm announcing the release of Test::SMTP. This module pretends to provide a framework for making SMTP server testing easy. We were doing SMTP testing with an instance of Net::SMTP, and with Test::More methods, seeing if everything was as expected. All this logic has been encaplsulated into Test::SMTP to make testing SMTP servers a little less of a pain.

Please note that this is a 0.01 version and is based on Net::SMTP as the client. Net::SMTP has it's limitations as a client that permits full control to the test. Don't get me wrong: as a "do the right thing for me when you can" client it's great. Try not to call Net::SMTP methods, as this class is a temporary bridge, just so the testing framework can be evaluated by the community (release early, release often).

Things in Test::SMTP that need to be issued in the future:

  • Test::SMTP cant simulate plain old (helo) smtp clients if server supports ESMTP. Underlying Net::SMTP auto negotiates ehlo/helo when an instance is created.
  • Net::SMTP supports method is called, although not documented in Net-SMTP docs. It's name seems to be public by name :p
  • No STARTTLS support because Net-SMTP doesn.t support it
  • Auto selected AUTH. See Net::SMTP for supported AUTH methods and code for how it selects the auth


  • You can simulate multiple clients in the same test. Just call connect_ok more times and you obtain more clients.
  • Simulation of misbehaving clients is supported. Test::SMTP inherits from Net::SMTP. You have access to the methods of IO-Socket-INET, Net-Cmd. Because of auto-helo/ehlo you cant issue commands before the helo phase, though.
  • Mail addresses passed to Net::SMTP methods to and mail are mangled by Net::SMTP to try to produce good commands to the server. These have been worked around adding mail_from and rcpt_to methods, that issue MAIL FROM and RCPT TO commands

Future plans are to implement a "don't do things automatically" client so you have all (or at least more) control over the client.

Introducing Catalyst::Authentication::Credential::Authen::Simple

Posted on 26/11/

Just got another module out!

This module isn't at all complicated. I'm even surprised that anyone hasn't already written it! Authen::Simple is a great authentication framework (thanks to the excellent work of Christian Hansen). We've been using it at CAPSiDE for quite some time now, but we hadn't developed a Catalyst module for it because we are normally using mod_auth_tkt, so our Catalyst apps aren't authenicating directly. I recall the need for Catalyst apps to authenticate against external datastores from the mailing lists, and a recent conversation with Ton Voon made me think that it's time to write the module so Catalyst can do fancy authetication

Catalyst::Authentication::Credential::Authen::Simple is just glue between Authen::Simple and Catalyst. It reads the Catalyst App config, instances the appropiate Authen::Simple objects, and then just calls autheticate on the objects when you authenticate from within Catalyst.

It's that simple... Authen::Simple...

Opsview support for NagiosChecker

Posted on 26/11/

I've been using the FireFox NagiosChecker together with Opsview. I found this plugin because someone on the Opsview list asked if it was compatible with Opsview, and I tried it out. It worked well, aside from one little issue:

NagiosChecker authenticates with Basic HTTP Authentication, and Opsview doesn't like that. You configure NagiosChecker, and it doesn't work. Opsview needs a valid cookie to authenticate. If you login to Opsview, you see NagiosChecker start to work. That's because FireFox stores the cookie needed to authenticate, and on the next request NagiosChecker makes, the cookie gets sent to Opsview!

So I only had to make NagiosChecker log in to Opsview the first time it requests the Nagios status screen. I added a checkbox to the NagiosCheck server setup screen so you can tell it it's an Opsview server

I've contributed the patch to the NagiosChecker project, but in the meantime, I've packaged NagiosChecker with my patch so you can try it out. Feedback is welcome ;)

Download the NagiosChecker with Opsview support. You can also patch your installation of NagiosChecker.

Be the ticket booth

Posted on 26/11/

Now that we know how mod_auth_tkt works, we are eager to implement our applications authentication with it. The module never generates the ticket. Instead, ticket generation is delegated to the login URLs.

You can generate a ticket from your favourite language. The mod-auth-tkt distribution includes: a perl module, a python module, and php helper functions. There is a login perl CGI script that uses the perl module, and is prepared to do a lot of things just configuring it, and filling in the "sub validate" so user and password get verified against any database you want. Look at the example: require the class that will do the validation, and then return true or false if the supplied credentials are not correct.

So... What does the aplication have to do to get into the single sign on world? In many cases: nothing. If you have been relying on Apache basic authentication, you probably have been recieving the already authenticated user in the REMOTE_USER environment variable. When a valid ticket is detected, the module takes the user for which the ticket was generated for (remember that if the ticket was expended, the supplied credentials where correct) and sets the REMOTE_USER. So if your application was using basic authentication, you are in luck: set the Apache config and let it run!

If you were authenticating within your application, you are in less luck. There is a forest of possibilities of how your system is working, but most probably you are just storing the logged in user in the session once authenticated, or getting the logged in user from one single point in your code. You can see where I'm getting... Just start to rely on the REMOTE_USER from that point.


Posted on 26/11/

I've just released another module to CPAN. This one is Net::Server::Mail::ESMTP::SIZE.

When I developed tests for Test::SMTP, I had to implement a mini-SMTP server. Instead of reinventing the wheel, I chose to use the Net::Server::Mail distribution to fastly have what I wanted. But to test the supports_cmp_ok, and supports_like there was no extension that reported parameters pre-built. So I stubbed in the SIZE extension only for the tests to play around with.

On my list of "possible CPAN modules" appeared the Net::Server::Mail::ESMTP::SIZE, but this time implementing actual functionality. As with the first attempt I made great progress, and everything went quite straightforward... It's done!

One problem I had was actually getting the module to plug-in to the Net::Server::Mail::ESMTP... The documentation is quite short, so I had to "inspire" myself off the code from the modules that were already written, and do lots of Dumper(@_) to see what was going on... I hope I got it all right :p.

Faster isn't always scalable

Posted on 26/11/

Sometimes when designing we go great lengths optimizing for speed, and not always think of scalability. When thinking scalable you have to tend to think of letting operations be done in parallel and thus locking as little common resources as possible so that the work can probabilistically be done in parallel. And sometimes, to be fast, you hold a lock, so you can make the assumption that you are alone (you can overlook sincronization with others, and thus the overhead). But that means that you are the only one that can be working.

As an example: MyISAM tables are fast reading and writing but scale badly for writes. As concurrent reads go up, one single write locks up ALL the reads on the table, because writes hold a lock on the entire table until they are done. Innodb, in change is slower updating rows, but because writes only lock the rows that they are writing, the reads can still be done concurrently if they are addressed to unlocked rows.

The confusion normally comes from faster meaning less CPU cycles, and since a CPU is a locked resource, the faster you do things, the more you can do in parallel.

Think before holding a lock ;)

Opsview Asterisk notification

Posted on 26/11/

After a couple of weeks in internal testing, I can now contribute a helpful notification method for Opsview, and more generally for Nagios.

We wanted Asterisk to wake up an on-guard engineer if an alert was detected. What seems like a pretty trivial thing has a couple of subtilities that have to be treated with care to not make a nightmare out of it.

First: how to make asterisk call a phone number. There are a couple of documented ways to do it:

  • create a call file on the server. This would have to be done via ssh from the nagios master server. I don't like this because it is touching the internals of asterisk, although it is standard (much like creating a mail in the sendmail queue).
  • via the Asterisk Manager protocol (astman). This seemed more suited for the task. The downside is that the perl API for astman is quite spartan and not all that documneted (Asterisk::Manager).

When someone asks me to choose an option from two that I don't like, I always ask them this question in return: "What do you prefer? Syphilis or Gonorrhea?". This time I was asking it to myself... So I chose the astman solution. I hope that this way the Asterisk::Manager module will get a bit more mature, and therefore be a better long term solution.

Second: The calling gets done in asterisk. The notification script only calls an asterisk extension. We wanted a "human hunter" that would not stop calling until someone acknowledged the alert. Maybe someone would want a different behavior, so that is customizable via asterisk programming.

Lessons Learned

  • Astman was the right way to go: Astman has access rights per user and per host in the manager.conf file.
  • Alarms are random and can happen in parallel: The first day in alpha there was a connectivity problem. A lot of alerts where spawned. The poor guy at the phone had to acknowledge a lot of calls :S. We noticed that a "don't call me if you have just called me" mechanism was needed.

    A configurable lock out mechanism was added so that some calls could be made in parallel in a customizable way. Maybe you want to call a number two times in a row if the alert is for different hosts or host groups, or maybe you just want one time alert per phone number called, or just notify to one phone whatever the notification is.
  • Nagios kills "lazy" notifications: Because we hunt down someone, the call can get long. Another time... someone had to acknowledge a lot of calls that day... :S The call is not registered as successful until Asterisk says that it's successful and then it's registered in the notified database. Other notifications get queued up while a lock on a notification db is held. When tested out of the Nagios environment everything was OK. Debugging revealed that when the notification script was taking more than x seconds, Nagios was killing it, the lock was released, and Asterisk was continuing with the call. The next notification was kicking in (because the lock was released), and Asterisk was dialing again to the same contact.

    This was resolved by forking, and detaching the child from the father process (just like a daemon does). The detached process does the calling. The father returns inmediately.

So you can get the notification script here. At the end it got a little more complicated than it seemed at first :)

New design

Posted on 26/11/
This time it's a design! Thanks Pau.

Nagios Checker patch got through

Posted on 26/11/

I'm pleased to announce that my patch for Nagios Checker to support Opsview is is now available in the official distribution of the plugin. You can get it here

I've been using Nagios Checker for quite some time now, and I like it very much, and now that it's patched for Opsview, I like it even more. Instead of having a browser window open, and going to look at it when an alert email gets in my mailbox, I get a nice warning sound, and an overview of the problems just hovering over the checker. Direct access to Opsview is granted just clicking on the alert. Need to add a host? Or curious to see the Opsview HH page? Just click on the 'go to Nagios' menu when you right click on the Nagios Checker status.

My office colleagues have been beta testing the patched plugin, and find it very useful, but they had a bit of trouble configuring NagiosChecker correctly to play with Opsview so here is how to do it:

When you're configuring your Opsview host in Nagios Checker:

  • URL of the Nagios Interface: http://my.opsview.server/
  • Type of server: Opsview
  • URL of status.cgi: Select manually, and fill in http://my.opsview.server/cgi-bin/status.cgi

I haz got Kwalitee

Posted on 26/11/

I've been trying to increase the kwalitee of my modules in every release of each of my modules. Looks like I got it right.

A couple of tips are:

  • use recent Module::Builder: It gives you kwalitee very easily, as it does tons of stuff for free. But use an actual version. The first modules i contributed used Debian Etch Module::Builder, and didn't generate a known-spec META.yml. Got that fixed free just upgrading Module::Build.
  • make manifest before doing the make dist.
  • use Test::Pod and Test::Pod::Coverage: Test::Pod will alert you if you have typos in your POD, and Test::Pod::Coverage will bug you when you don't document a function

Of course there is no guarantee that if a module has kwalitee then it's good... It has to have proper tests (Test::SMTP had 100% code coverage, and even that won't guarantee bugfree-ness), and those tests have to run on the most platforms possible (that wont assure anything either), and a bunch of other things which I'll write about in next articles... I hope I maintain my kwalitee (I like beeing on the first page of the "Authors with less than five dists" ;)

Comments activated

Posted on 26/11/

I've activated comments for the posts. You can rate them too... (so now I can know if you like the posts, and if I should stop blabbing about some topic :p)

Please be polite, and try to apport some extra content to the posts, without flaming, insulting and such. These actitudes will not be tolerated, and comments will be deleted without any type of explanation.

Machine naming schemes

Posted on 26/11/

When you industrialize your systems management (you are a hosting provider), or you simply have LOTS of machines for whatever reason, you have the need of a naming scheme. You have probably been naming machines by:

  • Planets: Sun, Mercury, Venus, Pluto, ...
  • Constellations: Andromeda, Orion, ...
  • Winds: Tramuntana, Xaloc, Garbí, ... (Here in Catalonia these are used a lot!)
  • LOTR: Mordor, Shire, ...

So you start naming machines with a scheme that helps you localize them: means rack 01 position 01 (positions starting from the rack bottom), for example. The downside is that once you have standardized the machine names, you loose that special "think of a name" moment, and the freakiness of the thing all together (people that are in IT usually don't know that machines even have names!)

I personally name my machines (and electronic devices that have computer-like functionality) with names of robots that appear in Futurama. So I have Bender, Flexo, Roberto, Calculon, etc. It's funny when I get into my bosses car and the hands free display says "Kwanzabot", and to see the HELO in SMTP headers display "SINCLAIR-2K".

Of course this is not a new thing, and RFC 1178 has some interesting situations in the "what NOT to do" guidelines XD. I'm pretty sure that most of us have fallen into one of the situations described in the RFC.

The bottom line is "try to have fun naming!" (when you can)

New style

Posted on 26/11/

New style for the blog! Contributed by Pau Puig, one of CAPSiDE's workers.

Thanks Pau!

The connotation of PIN numbers

Posted on 26/11/

I discovered some time ago a neat "trick" for mobile phones that not many people know, and that I'm sure the "security paranoid" bunch of people will appreciate.

When your mobile phone prompts for a pin, strangely, it lets you insert more than 4 numbers. That's because: pins can be longer than 4 digits on mobile phones!

I investigsted a little further and it turns out that wikipedia has it documented! It's curious how we have asociated the term "PIN code" with only 4 digits. Maybe phone manufacturers should of called it Unlock Number to take the 4 number connotation out...

So now you know it... you can have the worst type of PIN... One that is probably out of the mental scope of an attacker. And if you're a developer and need to ask for a numeric password be careful of the connotation of "PIN". Maybe you'll find yourself with all passwords beeing 4 digits long, altough you support more ;)

Writing great Nagios plugins

Posted on 26/11/

So you want to write a Nagios plugin, and you want it to be a great one! A great plugin, aside from having some great functionality is one that provides good documentation and fits nicely into the Nagios ecosystem, that is, that nagios users will be comfortable with it.

Right now you are thinking: "how can I do that? I have to look at other plugins, read guidelines, learn a lot about the nagios way to do things, and what the community expects from a plugin, etc. It's a quite big task, and I just wanted to write a quick and dirty plugin!"

If you program your plugins in perl you are a lucky man, because smart people have already done that for you! Nagios::Plugin helps you fit into that ecosystem and get a lot of functionality for the best cost: FREE, and get your plugins done in less time and with more features, with less bugs.

First step: Instance a Nagios::Plugin object

my $np = Nagios::Plugin->new(
        usage => "Usage: %s [-v|--verbose] [-t|--timeout=seconds] -c|--critical=<threshold>"
        version => 1.0,
        blurb => qq{Count the xxx's in yyy},
        extra => qq{
 -c 10
   returns CRITICAL if xxx's are greater than 10
 -c 20 -t 60
   returns CRITICAL if xxx's are greater than 20. Timeout in 60 seconds if it takes too long.},
        url => ''

You get:

  • standard parameters
    -V version info
    -h autogenerated help
    -v verbose output flag
    -t timeout
    nice features that you don't have to worry about, and that Nagios users will be very happy to have. Programs like Opsview will show the help on it's web interface (again... for free).

  • plugin versioning
    version and url get outputted for free (too) in help and -V
  • help text
    the help text consists of the version info, license (GPL if not overridden), blurb (text describing what the plugin does), parameter help list (autogenerated with the add_arg() info, and extra info. The extra info is the ideal place to give the user a couple of usage examples with a small description of what the invocation of the plugin with those parameters does.
That's a lot for one statement!

Second step: add your parameters

    spec => 'warning|w=s',
    help => "-w, --warning=RANGE\n     Range for returning WARNING"
    spec => 'number|n=i',
    help => "-n, --number=INTEGER\n     Number of yyy's to xxx",
    required => 1
    spec => 'filter|f=s',
    help => "-f, --filter=aaa\n    Filter by aaa",
    default => 'aaa'

# Parse @ARGV and process standard arguments (e.g. usage, help, version)
You get free parameter type validation, so if you declare that a parameter is an integer, the plugin will not go past the $np->getopts statement. You also specify a string for each parameter that will be displayed when the user calls the plugin with --help. If you are going to have a critical and a warning threshold, tell the user that they are RANGE items (you'll see why below). Some standard parameter names are:
-c critical range
-w warning range
-C for parameters that start with "c" other than critical
-H hostname: for names of machines
-p port: for port numbers
-4 for using IPv4
-6 for using IPv6

Third step: do what your plugin does

Now you have to work (hey! you haven't broken a sweat yet!). To get the value of the parameters passed to your script, you have handy $np->opts->paramname accessors.

Fourth step: return performance data (it's free)

You have almost surely collected a measurable quantity to compare against a threshold. Output the recollected data via performance data. I'm sure you will want to see how your recollected data evolves through time with a nice graphing tool. Is it going up? down? is it high at work hours? is it low on weekends?

    label => "size",
    value => $value,
    uom => "kB",
    warning => $np->opts->warning,
    critical => $np->opts->critical

Let UOM be:

  • no unit specified - assume a number (int or float) of things (eg, users, processes, load averages)
  • s - seconds (also us, ms)
  • % - percentage
  • B - bytes (also KB, MB, TB)
  • c - a continous counter (such as bytes transmitted on an interface)

Fifth step: return the status

Now you decide if the plugin has to return CRITICAL, WARNING or OK. This code quickly springs to mind:

if (recollected_value > critical)
elsif (recollected_value between critical and warning)

What if somebody wants OK between critical and warning? Again you can work less and get more: $np->check_threshold to the resue! Nagios has a RANGE specification that check_threshold understands so you can just pass the recollected value, the critical parameter and the warning parameter. You get the status that has to be returned calculated for free!

my $status = $np->check_threshold(
    check => $value,
    warning => $np->opts->warning,
    critical => $np->opts->critical

Now just return the calculated status and a little single line text with the exit method. Don't be too verbose, though, because the output gets cut!

$np->nagios_exit( $status, "$value xxx's where found" );

More neat (and free) details

  • verbosity
    $np->opts->verbose will return the number of -v flags in the parameters. Use it if you want to give the users a little more info (-vv or a little more (-vvv or a lot more)) :p.
  • Read the docs
    The docs will reveal all sorts of extra info. Read the helper classes (Nagios::Plugin::Xxx) documentation too, because not everything is exposed in the Nagios::Plugin documentation ;)


Nagios::Plugin will save your time, and make your plugins better, with less effort.

Proud to see Opsview 2.12.1

Posted on 26/11/

The development work that got done when Ton Voon came to CAPSiDE has got through. I am proud of the add-ons that we have contributed, and hope to add more over time.

A lot of effort has gone into each feature by the CAPSiDE Team and by Altinity.

CAPSiDE added features are:

  • Single Sign-on
  • Event handlers
  • Customizable host check commands
  • Customizable SMS Notification methods

One thing from CAPSiDE didn't make it in to the 2.12.1 release (but hope will soon come) is Nagvis integration so you can map out your servers and see them the way you want to.

We are looking forward to hear if these add-ons have been useful to the community, and if they are being used and how. Drop us a mail to the opsview users list ;)

Getting to the backends

Posted on 26/11/

As I already exposed, simple web apps will be using mod_auth_tkt pretty fast if they where counting on http basic authentication.

When you control the software being used (be it yours or open source) you can always take on parsing the ticket to get the info back, be it in a cookie, be it in a parameter via GET.

Let's examine a more complex scenario. Problems start ariving when using application servers, or proxying to non auth_tkt aware servers or applications. The frontend can validate the ticket, (authenticating the user), but, since mod_auth_tkt basically leaves the ticket in the REMOTE_USER environment variable, and these variables don't get proxied, you don't recieve the logged in user in the backend. So... lets try to find some ways of getting the info to the backends (thanks to the people on the mod_auth_tkt list for the pointers).

Using headers

Put the REMOTE_USER in an HTTP header. Use mod_headers.

ProxyPass /headertest/ http://backend/xxx/
ProxyPassReverse /headertest/ http://backend/xxx/

<Location /headertest/>
   AuthType Basic
   TKTAuthLoginURL /login
   TKTAuthTimeout 600s
   RequestHeader set X-AuthTkt-Remote-User "%{REMOTE_USER}e"
   RequestHeader set X-AuthTkt-Data        "%{REMOTE_USER_DATA}e"
   RequestHeader set X-AuthTkt-Tokens      "%{REMOTE_USER_TOKENS}e"
   require valid-user

And in the backend, just pickup the results! (If you are running a CGI on the backend, just loookup the environment variable: HTTP_X_AUTHTKT_REMOTE_USER, HTTP_X_AUTHTKT_TOKENS, HTTP_X_AUTHTKT_DATA. Of course, you'll say! I have to modify the backend software to read from the HTTP_X_AUTHTKT_REMOTE_USER. If the backend server is another Apache, you still have an Ace up your sleeve mod_setenvif.

    SetEnvIf X-AuthTkt-Remote-User "(.*)" REMOTE_USER=$1
    SetEnvIf X-AuthTkt-Data        "(.*)" REMOTE_USER_DATA=$1
    SetEnvIf X-AuthTkt-Tokens      "(.*)" REMOTE_USER_TOKENS=$1

Using URL GET parameters

You can rewrite the REMOTE_USER to a parameter in the URL. mod_rewrite can handle this with it's eyes closed, and fetch that in the backend.

ProxyPass /headertest/ http://backend/xxx/
ProxyPassReverse /headertest/ http://backend/xxx/

<Location /headertest/>
   AuthType Basic
   TKTAuthLoginURL /login
   TKTAuthTimeout 600s

   RewriteEngine on
   RewriteRule  ^(.+)\??(.*)$   $1?remote_user=%{ENV:REMOTE_USER}$2    [QSA]

   require valid-user

mod_rewrite can set environment variables too, so, if you do the inverse process (set the value of the GET parameter to the environment variable), you get the same result. I like the header solution best because mod_rewrite is a heavy module, and just adds the module that the frontend needs, and the one that the backend needs.

There was a comment on the list on getting username and password to the backends (for apps that need the two on every request), but for that you have to store the password encripted in the cookie. I'll have a shot at that one in another post (and maybe use the tecnique in the real world in an OS application... we'll see).

I wish I never hit the send button

Posted on 26/11/

Every day we send out lots of mails. I normally read a mail two times before sending it to a customer. And despite that, there have been times where I wished that a message had not gone out. Maybe I pressed the shortcut to send out the mail when it was half finished, maybe I got an afterthought on how to express something, or on how to solve an issue in another way, or to include someone in the conversation...

The other day, talking with one of our customers project manager, he told me he was going to send me a mail while we were at the phone. He told me that I would recieve the mail in one minute, and the curious thing: "one minute" was not an expression. I got interested in the delay, and just had to ask why. Basically he has a rule that delays all outgoing mail for one minute before submitting it to the server. He gave me a nice and easy solution I hadn't ever seen. You can cause a configurable delay to your outgoing messages with a very simple Outlook rule! Now you always have a second chance! After all the customer probably won't notice the delay.

I'm not saying that this is the remedy to all mistakes, but I don't know why, when you press the send button, a background process kicks in and you realize your mistakes, and this is a nice way of getting to the message before it really gets sent. Of course your brain can adapt to kick that background process in after the delay... you never know brains! ;)

I liked the solution because it was a way to use Outlook rules that I had never thought of, although you can see that it isn't hidden at all (create an outgoing message rule).

Oh... wait... I wanted to blog about open source things! I tried to get the same functionality out of Thunderbird but it seems that rules only apply to incoming mail. Does anybody know of an Open Source mail client that can implement this sort of behaviour?

See you in NETWAYS Nagios Konferenz

Posted on 26/11/

I'll be attending Nagios Konferenz 2008 September 11 and 12, in Nuremberg. I'll be presenting the Review of notification methods talk. In other words, how people are notifying, and if you want to develop your own notification methods, teach you what we've learned the hard way about doing so.

Hope to see you there, and looking forward to hearing the very interesting presentations that are scheduled (It's a shame I don't know German!).

How we notify engineers on 24/7 [Feedback III]

Posted on 26/11/

After my talk, some people asked me if I could publish the Asterisk code that we use to notify our on-call engineers. We published asterisk_notify that comes to be the glue between Nagios and Asterisk, but didn't publish the logic that was in Asterisk to be shure that someone (human) picked up the phone, and got notified that Nagios detected a problem.

I am no Asterisk guru, and I have a hard time programming in Asterisks "language" (kind of reminds me of BASIC). I cannot guarantee it will work for you in your environment. I can only say that it works for us. Suggestions and Improvements are welcome (I said I was no Asterisk guru... I just went around the docs and searched and investigated alot in VoIP Info).

; Custom Nagios Notify Extension
; (c) Jose Luis Martinez
; Use at your own risk. 
; s,1 defines where to get the number from. Select one of the s,1
; lines, and comment out the others...
; Just for hard-coding a list of numbers 
;exten => s,1,Set(SUPPORT_GROUP_NUMS=0666666666#0666666667#);
; The nagios notification script was setting SUPPORT_GROUP_NUMS, so
; the STUB=1 action was just to have an s,1 action (to not touch the rest
; of the extension 
;exten => s,1,Set(STUB=1);
; we use s,1 to setup a variable named SUPPORT_GROUP_NUMS that will contain
; the list of numbers that Asterisk will "hunt down"
; AGI looks up who to call in our on-call database and
; sets that variable via the AGI interface. 
exten => s,1,AGI(
exten => s,2,NoOp()
exten => s,3,SetLanguage(es)
exten => s,n,Set(RingGroupMethod=hunt)
; make macro "nagios-pickup" handle when the user answers
; the timeout waiting for someone to answer is 30 seconds. 
exten => s,n(DIALGRP),Macro(dial,30,M(nagiospickup)m,${SUPPORT_GROUP_NUMS})
exten => s,n,Set(RingGroupMethod=)
exten => s,n,Goto(custom-nagiosnotify,s,2)

exten => s,1,Wait(1)
; playback a sound that says: "There is an Opsview alert. Please press 1"
exten => s,2,Playback(custom/alertaDOpsview&custom/premi1)
exten => s,3,Read(OneKey||1||1|5) ; Store in 'OneKey' the pressed key. timeout in 5 secs
exten => s,4,GotoIf($[${OneKey} = 1]?s|5:s|8) ; GoTo prio 5 if "1" was pressed; else to prio 8 
exten => s,5,NoOp(Caller marked 1) ; Called person pressed number 1
; playback "you have an alert" in a loop
exten => s,6,Playback(custom/alertaDOpsview)
exten => s,7,Goto(s,6)
; We got tired of waiting for the user to press 1. We'll continue down the hunt list...

One known bug is that if you press number one BEFORE hearing "There is an Opsview alert. Please press 1", the call doesn't get acknowledged. Of course, It will call you again, and you'll have the time to not be impatient this time ;)

If you use or adapt this script, make it do interesting new things, fix bugs, etc. please give me feedback.

Notifications stall Natgios [Feedback I]

Posted on 26/11/

Notifications are NOT Asyncronous, that is, nothing goes on in Nagios while your notification script is running. So try to be fast when notifying. Or simply return something inmediately (Nagios does nothing with the return code from notification scripts for the moment). Note also that your script will be killed if it's taking too long.

If you still have a long-running notification script, you can opt to fork, detach the child process (like a daemon does), and do all the work in the child. Just return something inmediately in the father process. If your notification script is in perl, just do this:

use POSIX 'setsid';

open STDIN, '/dev/null'     or die "Can't read /dev/null: $!";
open(STDERR, "> /dev/null") or die "Can't write to /dev/null: $!";
defined(my $pid = fork)     or die "Can't fork: $!";
exit if $pid; # parent process just exits.
setsid or die "Can't start a new session: $!";

Nagios::Plugin::DieNicely Released

Posted on 26/11/

As your Nagios plugins get a bit more complicated, and depend on external CPAN modules you will find yourself with spontaneous UNKNOWN states on Nagios when the services that you monitor are faulty. This will probably come from the fact that different modules have different ways of notifying that something has gone wrong. Some return undef, and some call die or croak.

When they call die is when you have Nagios reporting UNKNOWN states, and "no output". Nagios will consider exit codes that it doesn't know as unknown states, and perl exits with 255 on die. And one more thing: the exception gets printed to STDERR, and Nagios will just discard it. So you never know what hit you.

Normally you program thinking that things go well, and if there is an unhandled exception the program is supposed to die. But we're monitoring... an unhandled exception can probably give some important info on what's going on. So you wrap the code you THINK will fail around an eval, and you exit with the appropiate Nagios exit code if there is an exception. But what will you do? Wrap everything around an eval? Ugly. And you have to remember... Fear not. Just use Nagios::Plugin::DieNicely and program as always.

Nagios::Plugin::DieNicely will trap perls die (and Carp's croak and confess) for you. Then it will output the exception to STDOUT in Nagios format and exit with a Nagios CRITICAL exit code. So now you have one less thing to worry about.

This module was motivated by a real case. We were (and actually are) monitoring web services with the CPAN Soap::Lite module. These web services fail very often due to uncontrollable (by us) causes. So I have had the opportunity to see the Nagios check that attacks them in a variety of cases when the web service / server is failing. I've gone through 4 (or so) revisions of the code that returned UNKNOWN states in corner cases where the where the client module would behave in unexpected ways, and a couple of them where "die cases" that I wrapped an eval around. But I finally thought that this could maybe be done a better way.

Command Line arguments vs Environment Varaibles [Feedback II]

Posted on 26/11/

Use command line arguments to your service checks and notification scripts, as Nagios will be able to optimize them. Nagios 2 used to calculate all the values for all the macros before executing a command. The number of Macros is quite big (see v2 vs v3, so there's a lot of time waisted calculating values that will never be used.

Nagios 3 will just look at the Macros it has to resolve and only calculate those ones. Of course, it cannot look at or interpret what environment variables a check or notification needs. If you where relying on getting info from the Environment Variables, then you won't find any data, unless you tell Nagios to revert to the old behaviour (and pay the penalty of calculating everything every time). See for more info

Being the guy at the door

Posted on 26/11/

Sometimes you just can't rely on the Apache module for what you are doing. Maybe because you are not using Apache... or maybe because you find yourself in a situation where your application runs on a separate server, and Apache is just proxying requests back to you. The backend server doesn't recieve the famous REMOTE_USER environment variable because environment isn't passed when requests are proxied. If you know a way of getting the ENV to the backend server, drop me a mail (pplu at

So you are on your own! Just very recently the Apache::AuthTkt module got updated with a method to get info from a ticket back. That is: it used to be a one way ride: you could generate tickets, but from a ticket you couldn't get anything back, so you couldn't validate the tickets you generated (the module is supposed to do that).

Getting the application to properly handle the tickets is not that straightforward, so I'll detail what you have to do to get it (hopefully) right:

  • - no ticket: Show the login screen. Verify the login screen's supplied credentials against the credential db of your choice and extend a ticket if credentials are correct. Redirect to original (protected) URL. This time you'll have a valid ticket and get past. Of course, instead of showing the login screen you can show contents for anonymous users, if you like
  • - ticket expiry: when you parse the ticket you get the timestamp of the time it was generated. You have to control if it has expired (ticket.ts + seconds for which the ticket will be considered valid < now). If the ticket has expired: show the login screen
  • - ticket renewal when the ticket is close to expiring your application should renew it (generate another one with a new timestamp), so suddenly the user doesn't get logged out. If you don't, your logins will only last for a maximum of expiry seconds...
  • - cross domain authentication:
  • Take into account that the ticket can be sent instead of via cookie, via GET.
  • - ticket tampering:the logged in userid, timestamp, and tokens (if any... see docs for more details) are beeing transmitted in almost clear text. So what if someone changes the data in the ticket and submits that? Luckily there is a digest field in the cookie that gets formed with: MD5(clear text info + ip address + the secret) the real implementation does more things, but this serves to make my point clear. The server can validate if any of the clear info or the IP was changed by just regenerating the digest, and comparing it to the one that was recieved in the cookie. If the expected digest doesn't match with the new digest: show the login screen.

On this last point we had a bit of a surprise. In Apache::AuthTkt you could call the new method parse_ticket, it didn't return the digest, and it didn't do the validation. So if you where relying on that method to see if the ticket was valid, you would be accepting tampered tickets. So Ton Voon and I updated the module so it would have a new method: valid_ticket that verifies the digest and only returns data if the ticket has not been tampered with. Hopefully the patch to the module will get to the CPAN Apache::AuthTkt module soon. Ticket expiry and renewal are still the applications responsibillty.

PHP and Python contributed API can not parse and validate the cookie. So if you are using those languages, take into consideration extending those contributed modules to do ticket validation.

PHP is almost always running under Apache, and if it runs under FastCGI there is no problem: it will inherit the REMOTE_USER environment variable, and you won't even notice. I suppose that python boys can rapidly implement the parse_ticket & valid_ticket methods with their eyes closed.

Apache::AuthTkt changes got through

Posted on 26/11/

The changes that Ton Voon and I made to the Apache::AuthTkt module got to CPAN! Now there is a validate_ticket method that can return the data in the ticket, previous verification that is has not been tampered with. See Apache::AuthTkt for more details.

Peter Karman, owner of the Catalyst AuthTkt module is going to update his module to use the new characterics. And hopefully implement some of the functionality discussed on the mailing list.

Lightning fast debian installs

Posted on 26/11/

At work we are constantly installing debian virtual machines to test the software we produce, test installation procedures, or just wanting to get our hands on a machine to do a couple of tests.

We normally start out with a netinstall CD, or a VM image of a recently installed netinstall CD. At one point we thought it was boring while we waited for all the extra packages to download.

So we decided to host a debian package mirror in our office to speed things up, to be polite with debian servers, and to not affect the people in the office that see how their communications slow down.

We chose apt-mirror. Its name speaks for itself. Curiously it's made in Perl :). It can mirror parts or all of Debian package repositories. Now we just change the sources.list on the VM and we get LAN speeds when downloading.

A couple of downsides are:

  • You will download packages that you never need (although you never know... never say never)
  • You will probably add the apt-mirror to your crontab. apt-mirror is pretty aggressive by default. It launches 20 .threads. (they aren't threads... they are wgets). This can affect your connectivity quite badly.
  • Even only one wget can hog up all the bandwidth. Wget has a rate limiting parameter, but apt-mirror doesn't use it. I was about to patch apt-mirror myself... but thinking that someone could have already had that problem... I found that at the projects' SourceForge page there is already a contributed patch! We've tried it and obtained positive results. I wonder when it will get standard.

The next time you install a Debian think that it could be faster ;)

Catalyst::View::RRDGraph v0.02

Posted on 26/11/

Got out a new version of Catalyst::View::RRDGraph.

New features are:

  • Detect 0 byte files: There are situations where the generated RRD image file is 0 bytes, but RRDs::error isn't defined, so you don't get a valid image served. This condition is treated as an error from now on.
  • ON_ERROR_SERVE: This config key let's you control the image that you want served on an error. You can also generate custom conetent setting it as CODE reference.

Thanks to Ton Voon for sending in a patch and adding tests for the module.

AMF from Perl

Posted on 26/11/

The other day I was at the MIF Onsite II, in Barcelona. MIF Onsite was presenting the upcoming Flex 3 and AIR development environments. The presentations were nice, and Adobe had a little lunch prepared for us.

Flex can, of course, talk to the servers making bare HTTP requests. It also supports web services (SOAP). But one of the interesting things was AMF (a protocol for "Remoting"... that is using objects, and calling methods on a remote server). You'll say: "big deal! another one!". The interesting bit is that it is more optimal for transmission through the wire (it isn't as heavy as XML, which you have to construct, parse, is pretty redundant, etc). On a couple of slides they referenced languages with support for AMF. Those were PHP, .NET, and Java. Like a devote Perl programmer I just had to ask if there was a Perl implementation. I asked if there was something for the python boys too! The presenters where quite sure that there was a Python implementation, and it seems like a pretty nice one: PyAMF, but for the perl boys... they hadn't heard of anything.

A quick search got me to a Perl implementation AMF::Perl. But one thing doesn't convince me: "AMF::Perl - Flash Remoting in Perl Translated from PHP Remoting v. 0.5b from the -PHP project.". PHPAMF is in version 2.0 beta now, and AMF::Perl hasn't been touched since 19/Sep/2004. Since I can't find AMFPHP historic, I don't know which features were added to PHPAMF that the Perl implementation is lacking.

I'll try to contact the author to see if he can update on the status of the project.

mod_auth_tkt: simple single sign-on

Posted on 26/11/

mod_auth_tkt is a handy Apache module for single sign on user authentication.
I like to explain what mod_auth_tkt does with a trip to the movies.

So... You go to your nearest movie theater. If you try to pass the doors without your ticket, the guy at the entrance stops you, and tells you to buy a ticket. You then go to the ticket booth and buy a ticket. You return to the door with the ticket, the guy validates it, and lets you go in. Then you see that a friend is waiting outdoors... so you walk out, talk to your friend, and then walk in again. The guy sees your ticket... And you're in again.

That is exactly what mod-auth-tkt does...

When a browser makes a request to a protected URL and doesn't have a ticket, the module (guy at the door) redirects to the ticket booth (login page) that gives him a ticket (cookie) if his credentials are valid, and redirects him happily on his way to the original URL. Every protected URL that is visited with the cookie is let past.

Back to the movies... If you approach the guy at the door with a little rectangular piece of paper, he won't let you in. If you approach with a ticket from another movie theater he'll give you a strage look... and... not let you in. If your ticket is for a movie session that already passed you can't get in either.

So the module does that too... It has a secret that only it knows (like the guy has a method of knowing if a ticket is for his movie theater). So all tickets for other sites can be rejected... Basically it takes a hash of some important parameters and the secret so that the data in the ticket is ensured to not be tampered with, and the ticket, too, has an expiry timestamp. So past its expiry timestamp it's not valid anymore (like the movie ticket).

But does the guy at the door reject all tickets that are not from the movie theater ticket booth? Maybe a third party can generate a ticket if it knows the secret formula the guy is using to know if tickets are for his theater.

If we share our secret with another mod-auth-tkt enabled system, it too can generate and validate our tickets. One web site can generate a ticket for another!

So now comes a techichal limitation. The ticket is travelling in a cookie. Cookies cannot "cross" domains. So browsers won't ever send a cookie generated for one domain to another. Don't get nervious... You can always send the ticket in the URL... and the module will accept it from there.

So now you know how to go to the movies ;). Have fun.

mod-auth-tkt home

Mod-auth-tkt has more interesting functionality, so I recommend to read the docs. The home page doesn't have as much info as I would like. See the README in the downloadable tar.gz.

Finding your lost eth

Posted on 26/11/

When we started cloning and playing around with VM images something strange started happening. The other day one of our hosting providers did a Bare Metal Recovery of a server, and the same issue appeared. So I guess I'll lessen the agony for others to come.

When you restore an image of a Debian (other distros apply too...) on another machine, net interfaces stop working. You will observe that if you had eth0 and eth1, these are not appearing in ifconfig output. Instead you have eth2 and eth3, which are not configured (ifup command refuses to bring them up), because /etc/network/interfaces only refers to unexistant eth0 and 1.

You can always edit the entries in the interfaces file and you're running again... but clone again... and you have the same problem...

If you run ifconfig -a, you get a surprise. eth0 and eth1 still exist!

This is because of the udev package, that Debian uses, which does'nt reassing hardware names on every boot (so one device gets one identifier forever). Since we are switching the hardware when we create a new VM, and the ethernet interfaces are identified by thier MAC address, we get a new ethx interface on the new machine.

The solution is quite easy:

rm /etc/udev/rules.d/z25_persistent-net.rules
Now you get your eth0 back again


Posted on 26/11/

Back to The blogosphere! (Just as if my last incursion had been all that good). What to expect from me? Just a little ranting about... the things that I do.

First I'll explain. I work in a small consultancy: CAPSIDE, based in Barcelona, Spain. We are specialized in internet projects and software development. We have a hosting division (with shared and dedicated hosting), profesional services, etc. And I'm in the middle of everything as CTO.

Now that I have decided to restart my blog, I am convinced that to create a certain blogging habit, it has to be very easy for me to write. So I'll be writing about things that are close to me and my job, but, of course, that I can disclose. So the blog will be about:

- Opensource programs that I primarily use or have to use, or have seen at work, etc.
- Bugs. Software has them, and i see a lot of them.
- Perl. My primary programming language. I love it. I Know there are other programming languages out there... But its the one that i have more contact with, so it's the one i can write about most.

Hope you like my blog