Getting ahead of the ball with service monitoring

In the earliest phases of development of our HERON clinical research repository, the only users were us developers and a handful of friendly alpha testers, so it was fine to discover problems as we used the system.

But one of the features included in milestone:EpicBetai2b2 is more proactive monitoring (#150), using the popular open source opsview toolset, built on nagios.

Once it was in place for HERON, I showed it to a guy who supports CRIS, and he figured out how to get nagios working on Windows servers etc.

CRIS is a long-standing production service. Its user community consists of clinical researchers. When CRIS acts up, they don't see it as an interesting technical puzzle to solve; they just see it as the darned computer getting between them and their research goals again.

The database under the CRIS service acted up over the weekend, but this time, instead of a call from a frustrated researcher on Monday morning, the CRIS support team got an automated notice right when the problem started and had it cleaned up before any users noticed.

It's so much nicer to be ahead of the ball in the customer support game.