![]() |
Service Level Agreements by Bob Walder Every few years, the computing world is witness to a wonderful new technology that will change our lives and solve all the problems inherent in the outdated computing model which preceded it. In the mid-1980s we saw a transition to the client-server model, where distributed computing was supposed to herald the demise of the hugely inefficient and costly mainframe system. In reality, of course, the mainframe never died, and client-server technologies although they can and do exist in their own right frequently overlay the legacy mainframe systems they were intended to replace. Similarly, beginning in the early 90s with the introduction of Web-based technology, the industry has witnessed yet another "breakthrough" which is vaunted by some pundits as the solution to all the problems that client-server has created. Realistically, Web-enabled computing can provide substantial value to the market, just as client-server has in many instances. But it too is destined to fall short of completely replacing the mix of legacy and client-server systems that now exist. In fact, one of the great selling points of Web technology is the ability to put a "friendly front end" on the previously user-hostile applications of yesteryear. The net result is that IT enterprises find themselves with these technologies layered upon one another, each performing those functions aligned with its unique strengths, yet yielding a dizzying array of critical platforms, networks and devices that greatly increase the management complexity. Capacity Planning And SLAs One of the areas where the administrator still waxes lyrical about the good old days is in capacity planning. One of the benefits of the centralised architecture of the mainframe was the ability to define Service Level Agreements (SLAs) with various end user departments. Capacity was a given. Usage was known. End users could be charged for bandwidth and usage, and it was all so easy because it was centralised. Capacity planning was taken care of by advanced modelling tools which allowed you to create precise software representations of your system and play "what if" with various combinations of hardware and application loads. This, however, is no longer the case. The widespread adoption of the PC LAN and the concept of distributed computing has led to a complete "mainframe meltdown", as the previously self-contained mainframe environment is spread across numerous servers and applications. The benefits of a distributed architecture are many, but its nature makes the modelling approach extremely difficult haphazard at best, and in some cases impossible. Setting SLAs on today's distributed nets makes for a whole new set of challenges, but that's where the current range of device and application monitoring tools from the likes of Cisco and Network General - could help. Most of the current crop of products compile performance trends based on RMON (remote monitoring) and SNMP statistics in order to report on one or more of three basic service-level metrics: availability, reliability, and response time. Although availability may appear to be a key metric, reliability is preferable when it comes to setting SLAs. For instance, a server may be available for nine hours out of every ten, which gives it an availability of 90 per cent. Doesnt sound too bad, does it? However, given the fact that it goes down so frequently, it is consistently unreliable and thus will receive a very low reliability rating. The other metric is response time. While availability and reliability offer high-level views of performance, response time is the best way to gauge how the end-user is affected, and is the metric which is of most interest to the user. It is no good, for instance, being able to prove 99 per cent availability and reliability if the response time for every user request is in excess of ten minutes the system is basically unusable without reasonable response time. Administrators need to consider how (or even whether) these parameters will fit into the SLA terms they set. What they really need to do is to tie the technology model to the business model. The whole point of the SLA as far as the end user is concerned is to determine what kind of application response times they getting on their business critical applications. From the IT point of view, the administrator has to be able to show management how at the infrastructure level they are helping to get more product out the door. Creating An SLA So what's the best way to integrate business and technical drivers in an enterprise management solution? Start by preparing a table with two columns, the left-hand column labelled "business goals," the right-hand column, "technical design goals." It's imperative to complete the left-hand column first, since this helps to put network management into a business perspective. Say the business goal is to reduce downtime - the corresponding design goals should be to make the network management system proactive and to automate key processes. Another business goal might be to enhance the company's competitive advantage. The corresponding technical objective would be simplification of customer access to the company's extranet. From a network management perspective, the challenge here will be to ensure the extranet is always available. Does the network management system allow for sufficient monitoring of the extranet? Is it capable of detecting conditions that might lead to down-time? Once this association between business and technology is developed, the rest of the implementation exercise deals with developing architecture, selecting appropriate applications, and deploying the network management system. SLAs are an important step in managing the expectations between IT and the business units. Many organisations already use SLAs with their outsourcers and carriers, and since IT fundamentally provides a service to its "customers" the end users, SLAs are being used increasingly within a corporation between IT and the business units. Although it takes effort to implement and perhaps even more effort to live by, an SLA is in the best interest of both IT and business users. By developing a set of mutually agreed-upon service characteristics, users know which services and response times are provided at what baseline costs. IT can show that it is providing timely services to corporate management and department users in language that's understandable to them. And an SLA provides a framework for getting additional IT resources when adding applications or improving existing services. IT and business units must develop SLAs in partnership. An SLA should outline what business users can expect in terms of system response, quantities of work processed, system availability and system reliability. An SLA should also spell out the measurement procedures to collect the service-level data and any limitations to the agreed-upon service provisions. It is critical to describe the services in terms that business users understand. It is no longer acceptable to hide behind technical metrics such as CPU utilisation, packets per second or WAN throughput. Instead, we should use a business-oriented metric such as message load handled per hour, or call-centre queue time. Desired State Management Only by accurately tracking the quality of service of an application (or service) across the enterprise can a company fully satisfy its business needs. To achieve this, a management system must provide far more than monitoring and control of individual hardware resources - it must provide the same for the applications which are the life blood of the company. "Back in the 80s when tools vendors such as ourselves started to fill the space for systems management tools, the solutions were primarily focused on helping operators deal with console-centred operational issues and activities, and were often restricted to a single platform" says Mark Rivington, European Product Marketing Director at Boole & Babbage. "In more recent times, these tools have been developed towards a multi-platform orientation combined with a more well-rounded functionality in the three operational sub-groups event management, performance management and automation. This is what has been viewed as "state of the art" for the last couple of years, and is an area which Boole & Babbage, for instance, has successfully captured with its COMMAND/Post and MAX/Enteprise management tools." Boole & Babbage is now looking to push the envelope even further with something it calls "Desired State Management". This concept describes the ability to set policies and manage applications proactively in order to meet the desired service level objectives. Instead of concentrating on monitoring the status of discrete infrastructure elements, Desired State Management (DSM) attempts to look at the overall state of business applications and the fulfilment of user-dictated SLAs for these applications. Whereas traditional event monitoring is largely a reactive process something goes wrong, you get an alert, you try to fix it - DSM takes event management and layers on top of it the ability to define SLAs to the system so that it can make adjustments continuously in order to keep the application performing as required. Event monitoring continues to play an important part in this approach, since the detection and handling of error messages is one significant source of information for determining an applications health. The key difference with DSM is the move away from the handling of isolated messages to looking at the state of various objects, such as servers or databases. Multiple alerts and error indications are then collated, aggregated and interpreted in an intelligent manner to gain an overall picture of the system state across multiple objects and platforms. Once an "informed" decision has been made about the system state, appropriate corrective action can be taken automatically. Similar functionality is offered by Seagates NerveCentre product in an NT environment. The DSM architecture is being extended across all Boole & Babbage product lines (COMMAND/Post, MainView, SpaceView and Command MQ) at client, server and agent level. The client products will share the same object oriented architecture and have a common Web-enabled user interface, whilst the new State Server will be based on Windows NT. The State Server provides a common object repository, a service level policy definition interface, and will "snap connect" to existing Unix servers in order to improve on their exception-based reporting capabilities. It is intended that Agent extensions called "Power Modules" will provide out-of-the box functionality for all leading hardware platforms, databases, middleware and distributed applications. This may seem like a tall order, and Mark Rivington is the first to admit that this is the case, but this is the direction in which the network management market needs to move in order to provide the quality of service required by todays business systems. Capacity Planning In A Distributed Environment Another company with an eye on the role of proactive management in todays mission critical networks is California-based Bluecurve. We have already mentioned capacity planning in the NT world has become something of a black art that is if you can call guesswork and intuition an art. But the IT Manager in a distributed environment based on Microsofts Back Office product family, for instance, still needs to answer a number of critical questions on a day to day basis. How many users can his server support? Can he run SQL Server and Exchange Server on the same machine? How many mail users can he support simultaneously on a single Exchange Server? Determining if and when you need to upgrade your hardware is often a matter of pure guesswork, and it is not always possible to be sure exactly which components require upgrading. Perhaps it is more memory you need, or a faster disk, or a more powerful processor or perhaps you simply need a whole new server. Bluecurves Dynameasure product provides the means to apply a controlled and repeatable stress to a network infrastructure in order to determine its overall capacity. More than a mere benchmarking tool, Dynameasure stresses the entire infrastructure clients, servers, network, and applications using real clients performing real transactions against real applications. Traffic for multiple clients can be generated using a single PC, allowing organisations with limited resources to run tests for hundreds of clients. All the tests are created and managed from a single console, and can be run in a full automated and unattended manner. The results are collated and presented at the same console, and provide both graphical and text reports of three key metrics data throughput (speed), average response time (the "user experience"), and disk, CPU and network utilisation (at both client and server). Although a huge number of standard tests are included as part of the package, one of the key features is the ability to define your own schemas, data-sets and transactions. This is particularly useful where you need to do some capacity planning for an existing system how many users can you add to your sales order processing system before it falls over, for instance? Using tools like Dynameasure is the only way to get a reasonably accurate indication of the capacity of a distributed system. In addition to pure capacity planning tasks, however, it can also be used to perform regular system "health checks" too. For instance, you could stress your existing system in a "normal" state to provide a baseline, and then check again on a regular basis to provide comparisons which can indicate if everything is OK, or if there is a gradually deteriorating situation. Should a user complain of poor performance often a subjective opinion you can then run off a new set of results as proof that system performance is as expected (or not, as the case may be), thus providing validation for your SLA. SLAs On The WAN In addition to drawing up SLAs to cover the internal LAN and its applications, the network manager needs to decide whether or not to use them to keep tabs on service providers as well. This can be a tricky area, since many carriers are unwilling to enter into agreements bound by SLAs. Political issues aside, however, there are a number of tools out there which will provide the means to monitor WAN link activity. The resulting data can be used both to verify contracts between a corporation and its carrier - be it frame relay, leased line, or ISP (Internet Service Provider) - or to validate internal SLAs that are based in part on WAN performance. It is important to determine what sort of delays a network application can live with before specifying a WAN SLA. Protocols such as SNA, for example, cannot live with delays of more than 150 milliseconds before they lose a session over frame relay. It should also be recognised that some problems fall outside the carriers control accessing an ISPs network, for example, involves traversing the local loop too, and problems there can hardly be laid at the door of the ISP. Summary Despite the increasing availability of tools designed to help in the creation of a SLA, companies still have to set their own business goals and pick relevant performance statistics against which to measure their service levels. This often requires a level of customisation beyond the capabilities of the in-house staff, making the option of farming out the service-level monitoring chores quite attractive. While ongoing fees can make these services pricier than off-the-shelf packages, the money saved by not having to staff up can often offset the cost. The network administrator probably spends a great deal of his or her day managing user expectations. The business manager complaining of poor response time doesn't care about server utilisation - he only cares about being able to use his application unhindered in order to do his job. Maybe you can't make him any happier when the application is slow, but at least with an SLA in place you have limited his expectations of the quality of service that can be provided to a reasonable level.
|
![]() |
Send mail to webmaster
with questions or
|