CIT services are monitored 24x7x365
(Thank goodness for robots!)
Many of us depend on an ever growing collection of computer based services. A
person can reasonably expect that their 3 email services, online class and
chat, real-time streaming news, music and market statistics will be working
all the time.
CIT provides a large number of computer based services. We typically deploy
these services with many "fault tolerant" features to avoid service
interruptions, but things do go wrong. When these services suffer some
disruption, we need to react quickly to bring these services back online
minimizing the impact on clients. In order to react to service interruptions,
we need to know about them, hopefully before our clients do. To that end we
implement a whole array of programs that monitor services and in many cases
implement automatic recovery and restart of services.
For most services, we use an open source project called Big Brother
(www.bb4.org). It's very extensible, and lets us easily monitor our systems.
If there's a problem, Big Brother pages us, sending detailed messages to let
us know what service has been affected. Big Brother presents a nice web page
which gives an immediate overview of monitored systems. To view UVM's BB site
go to https://www.uvm.edu/cit and click on Current Server Status. Big Brother
also maintains a historical view of events, so we can calculate system uptime
measurements and have a record of exactly when system events occur.
Occasionally a service failure goes undetected. There seems to be an infinite
number of pieces that can fail, and that they can fail in an infinite number
of new and exciting combinations. Undetected failures point out deficiencies
in our monitoring regimen. When this occurs we refine our monitors, adding new
pieces. We don't want to miss the same type of failure more than once. In
general, we know about problems before end users do, and very often correct
them before anyone is affected by the failure. A recent network related
problem affected some off-campus users who were using certain UVM based
services. Our existing monitors failed to detect this event. We will be adding
new monitoring features to make sure a similar failure in the future does not
go undetected.
UVM's Big Brother environment is currently monitoring about 120 servers
for general connectivity to the network, and about 150 specific services. We
monitor for example whether server www.uvm.edu actually serving web pages, and
whether the imap servers are responding.
We also have programs that more extensively test certain critical services.
For example, we have monitors that actually login to an imap service, and
retrieve a test email message.
Some of our monitoring uncovers problems before they affect services. For
instance, if the current usage of a server filesystem exceeds a threshold, a
page is generated, so we can move files before the filesystem actually becomes
full.
On average, we generate about 200 pages per month. Most of those are
informational, non-critical. About 20 of the pages are critical and
require immediate action. The pages that we receive contain enough information
that we can immediately assess the problem impact and respond appropriately.
Fortunately, most problems can be resolved from remote locations.