CIT services are monitored 24x7x365

(Thank goodness for robots!)

Many of us depend on an ever growing collection of computer based services. A person can reasonably expect that their 3 email services, online class and chat, real-time streaming news, music and market statistics will be working all the time.

CIT provides a large number of computer based services. We typically deploy these services with many "fault tolerant" features to avoid service interruptions, but things do go wrong. When these services suffer some disruption, we need to react quickly to bring these services back online minimizing the impact on clients. In order to react to service interruptions, we need to know about them, hopefully before our clients do. To that end we implement a whole array of programs that monitor services and in many cases implement automatic recovery and restart of services.

For most services, we use an open source project called Big Brother (www.bb4.org). It's very extensible, and lets us easily monitor our systems. If there's a problem, Big Brother pages us, sending detailed messages to let us know what service has been affected. Big Brother presents a nice web page which gives an immediate overview of monitored systems. To view UVM's BB site go to https://www.uvm.edu/cit and click on Current Server Status. Big Brother also maintains a historical view of events, so we can calculate system uptime measurements and have a record of exactly when system events occur.

Occasionally a service failure goes undetected. There seems to be an infinite number of pieces that can fail, and that they can fail in an infinite number of new and exciting combinations. Undetected failures point out deficiencies in our monitoring regimen. When this occurs we refine our monitors, adding new pieces. We don't want to miss the same type of failure more than once. In general, we know about problems before end users do, and very often correct them before anyone is affected by the failure. A recent network related problem affected some off-campus users who were using certain UVM based services. Our existing monitors failed to detect this event. We will be adding new monitoring features to make sure a similar failure in the future does not go undetected.

UVM's Big Brother environment is currently monitoring about 120 servers for general connectivity to the network, and about 150 specific services. We monitor for example whether server www.uvm.edu actually serving web pages, and whether the imap servers are responding. We also have programs that more extensively test certain critical services. For example, we have monitors that actually login to an imap service, and retrieve a test email message.

Some of our monitoring uncovers problems before they affect services. For instance, if the current usage of a server filesystem exceeds a threshold, a page is generated, so we can move files before the filesystem actually becomes full.

On average, we generate about 200 pages per month. Most of those are informational, non-critical. About 20 of the pages are critical and require immediate action. The pages that we receive contain enough information that we can immediately assess the problem impact and respond appropriately. Fortunately, most problems can be resolved from remote locations.