The Responsibility for Increasing Uptime - Where Does the Buck Stop in IT Operations?
Posted July 30th, 2008 by Jonah ParanskyUnplanned downtime has and continues to be at the top of lists of problems facing IT operations organizations. Considering the amount of focus and importance placed on the cost of downtime, one natural question is to identify the internal champion for increased uptime inside the IT organization.
To date, we have found that the job of increasing uptime is often highly distributed within the IT organization. For your organization, is the job of increasing uptime owned by:
- Architects (including Operations Architects or Infrastructure and Systems Management Architects) – who are typically concerned about designing an operational infrastructure that is scalable and robust?
- Engineering and Support Groups in IT Operations – who are often responsible for the day to day management of the IT operations infrastructure?
- Cross Functional Process Owners (such as Change Managers or Release Managers) – who manage changes, the process most often associated with the cause of unplanned downtime?
- Problem Managers - who are responsible for root cause analysis of repetitive failures?
- Disaster Recovery groups – who are often focused planning for recovery after a failure?
- Availability Managers – who, in organizations where they exist, often own defining uptime requirements and putting measurement programs in place?
- Infrastructure Outsourcers - who often deliver critical infrastructure components that become key parts of the software infrastructure stack?
- Networking groups – who own a critical part of the overall infrastructure?
- IT Operations Management – does the buck stop with the VP of Operations?
As in other business areas, spreading out responsibility is a recipe for a continuing problem area with no end in sight. Improving uptime through a continuous improvement methodology can help, but a clear lead with authority and budget can go a long way to bringing the focus and discipline to this critical problem area.
So who has responsibility in your organization for increasing uptime? Are you measuring end-to-end availability of your IT services and applications, or does the question only come up when a big incident happens? Where does the buck stop for downtime in your IT Operations group? And lastly, who are you holding accountable to improve the situation?
Popularity: 7% [?]
Filed Under: Downtime, IT Operations















August 1st, 2008 at 1:18 am
In the context of Infrastructure the responsible person in my opinion is the Infrastructure manager, but as Steve Ballmer said, it’s about Developers, developers, developers!
A system should be designed to cater for errors and outages without any resultant downtime to customers. As an example a bank branch should have a branch server that can process transactions offline for a limited time period should there be a complete outage on the connection to the head office.
In this case the downtime woul;d only occur when the offline window is exceeded.
August 1st, 2008 at 4:11 am