
Over the past decade, steady progress has been made on bottom-up, low-level man- management instrumentation of network and computing environments. And more recently, new service management systems have come on the market that work from the top down to develop an objective fact base on the service levels that users experience. What’s been missing is the middle ground: Systems-level management, which encompasses fault and performance management.
Device-level management is still moving ahead, and the latest improvements involve embedding Web interfaces directly in devices. These interfaces allow secure browser access using http to the functionality of device-level management agents. This certainly represents an improvement over the telnet-based command line access to these management agents that used to be the norm. Over time, SNMP-based monitoring and measurement collection will be replaced by data exchanges between the element management agent and higher-level management systems that run based on the WBeM CIM Interface Standard (Web-based enterprise Management Common Information Model – see this column in BCR, October 1998, pp. 30-32).
New directions for monitoring, measurement and trending also have emerged. Top-down, service level-oriented approaches use infrequent, application-level test transactions to determine the actual service levels being delivered to end users (see BCR, February 1999, pp. 20-21). These systems are usually combined with direct analysis of end-user transactions, as well as “synthetic” transactions. They compute availability, response time and throughput measures for network-based services, and create reports that compare real versus target service levels. The beauty of these systems is that they probe the network using the same transactions that users (or programs) generate; no special protocols need to be developed. By setting up agents at strategic points around the network that test critical services and applications, it is possible to get an accurate picture of and an early warning system for when service levels degrade.
Bottom-up and top-down meet in systems-level management, where the difficult problems of fault and performance management require the ability to integrate information from many sources and to understand relationships between devices, traffic flows, protocol layers, applications and services. Users consistently complain that they are not getting what they want from management software providers and systems vendors, and their complaints are understandable.
We’re pretty good at finding and fixing a failure of a single network element. But it’s much more difficult to identify, isolate and investigate problems that are the result of interactions among multiple elements or less-than catastrophic deviations from normal operation. Troubleshooting these kinds of problems often resembles untangling a ball of string or peeling an onion.
The reality of multivendor environments only exacerbates the problem. It’s not easy to separate elements that are working correctly from those that aren’t; we can’t easily partition our networks to test whether different parts work correctly on an independent basis. Lack of fault isolation test points makes it difficult to isolate which protocol and service layer is having problems. Good systems-level troubleshooting requires knowing how different elements function and the usual kinds of degradation and failure modes. It is much more difficult for a vendor to build that level of expertise into a management product than it is to develop good device-level measurement or configuration support. But that is exactly what users want. They need to substitute expert tools for experts, at least in some cases. …

