Posted August 5th, 2008 by Dennis Powell
Luke Franci and Matt Heuser think that TESTING IS OVERRATED (gasp!), at least according to recent blog posts from both under that title. Matt’s August 1st post at Creative Chaos provided an intro to Luke’s July 11 post at Rail Spikes .
Upon further review it’s clear that developer/QA code testing is their focus, and the “overrated” assessment points to an over-confidence by developers who test their own code during development (TDD) at the expense of separate QA efforts. Although both posts focus on application software testing and the dev-to-QA cycle, both posts provide a segue to the broader IT challenge of establishing readiness of the production environment to absorb change.
Matt notes that “testing” means “checking the software to find out if it works or not, right?” Yep, particularly if you’re the developer or QA professional at whom Matt’s blog is aimed. But what about IT operators and administrators, who have to ensure that all of this “tested” software works with the reality that is the data center today, meaning the vendor’s “tested” software, the partner’s “tested” software, the “tested” legacy software, the “tested” database software, the operating system configurations, the hardware configurations, the network permutations…not to mention the growth in hybrid physical and virtual environments? Oh by the way, not only does all this “tested” stuff need to work, IT needs to make sure that it runs 24×7, is always at peak, and meets all SLAs.
When there is a “we don’t need no stinkin’ QA people” mindset at the software development and unit test level, what happens to availability testing, capacity testing, performance testing, security testing, system testing, UAT…? What happens when this code makes its way “over the wall” to IT to be released into a production environment that’s about much more than just the code? This is why 25% of all changes to production cause problems, and why 10% of changes must be rolled back because they can’t be fixed.
Developers that “don’t need QA” might not test everything or test as deeply. QA groups that aren’t prepared to validate software readiness to run in the organization’s infrastructure in turn won’t be able to test permutations. Most importantly, pre-production IT personnel who rely on “component testing” or “patch-and-pray”, tend to introduce unnecessary risk and potential impact to production.
This is why, regardless of how briefly or thoroughly an application is tested, regardless of how broadly an application environment is QA’d, there needs to be that final line of pre-production defense based on sound change and release management principles. Build and maintain the representative staging environment, test every new component and every change (software, configuration, patch, upgrade, etc), test all changes end-to-end, schedule change testing and deployment, and follow change and release best practice guidelines.
Even the most effectively-tested software will invariably cause some type of problem when it finally gets deployed into the production infrastructure. Although it requires more time, resources and budget to make the proper service transition from code to production, treat testing at every level as an underrated service.
Popularity: 2% [?]
Filed Under: Change Impact Analysis, Change Management, Downtime, IT Operations, Testing
Posted August 4th, 2008 by Joe Pendry
Founder of Pingdom, Sam Nurmi, shares his reasons for starting the uptime monitoring service and his views on downtime. Prior to Pingdom, Sam was the CEO of Sweden’s biggest web hosting company, Loopia, which he sold in 2005.
Pingdom oversees uptime monitoring needs for 90% of the companies in the world, promising to maintain the best uptime monitoring service available. The technology behind Pingdom is developed in house, which gives them an unparalleled ability to satisfy both the current and future needs of the market. The company blog, Royal Pingdom, provides results from their research on downtime/uptime.
StackSafe: What made you want to get into uptime monitoring?
Sam Nurmi: In 1999 I founded the Swedish web hosting company Loopia, which within a few years was the largest web host in Sweden. Running Loopia gave me a lot of insight into the entire hosting industry and a lot of connections within the industry.
As a hosting industry insider I quickly learned how common outages actually were for various hosting companies, ISPs and for other services running on to the Internet. The thing that struck me was that most companies depending on this infrastructure simply didn’t have a clue about the extent of the problems. And after all, who visits their own homepage (or other online services) every minute 24/7? People just assume that things work. The only way to know was to set up some type of automatic monitoring.
When I initially looked into the uptime monitoring industry a few years ago I realized that the existing services on the market were really not built for the mass market. They were often difficult to use, expensive, and didn’t appear to be very scalable. Our basic business idea here at Pingdom is therefore to deliver an easy-to-use, reliable monitoring service at a reasonable price. Since we are aiming at the mass market (relatively speaking, we’re not Facebook), high volumes make it possible to offer low prices, and our system is designed to be able to handle the volume. Thanks to this approach, Pingdom is one of the fastest-growing uptime monitoring companies in the world today.
Another major point is that the Internet is still very young, and the Web is basically still in its bud. Even though the Internet is so young, a lot of the world’s economy already depends directly or indirectly on this global network and the services running on it. This means that the demands on its reliability will become even more important over time. Effective monitoring of this infrastructure is a huge, growing market.
StackSafe: Based on the case studies/examples you have seen, what seems to be the most significant source of downtime?
Sam Nurmi: The most common reason for downtime is really hard to pin down since there are so many factors involved. Application updates or changes can cause problems, network malfunctions are relatively common, and issues related to scaling, for example a traffic spike overloading a web server (Slashdot/Digg effect for a blog, to name one example).
Shared hosting environments come with their own set of problems, where the “neighbors” can affect the performance or uptime of the whole hosting environment either through heavy usage or misconfigured applications. These problems can be made worse by so-called overselling.
Then there should also be a distinction between short outages and outages that last for a long period of time. A common reason for longer outages is when people managing a service, for example a website, are not aware that there is a problem. Of course they can’t fix it if they don’t know about it. They will find out somehow after some time has passed, often through their own users, but that’s not a good solution, obviously.
StackSafe: What kinds of organizations see the most downtime? Ecommerce, media/news, applications, etc.?
Sam Nurmi: There is no clear pattern that any one industry or market suffers from more downtime than others. There are big individual differences in service quality within companies in the same sector, something we can clearly see among our customers and surveys we have done in the past. However, social media services tend to have more visible outages for the simple reason that they have so many users, and those users visit the site frequently.
An interesting thing you see quite often is that a service can run perfectly during a long period of time, and then all of a sudden have a significant performance degradation. It could be a response time increase or more downtime, or both. Common reasons for this are surges in user numbers, software updates or other modifications to the service that have unexpected side effects. Another reason can be that the hosting provider or ISP has started serving beyond its capacity, for example too many customers on the same server (or cluster) or having too many customers sharing the same Internet connection.
StackSafe: What are some measures companies can take to prevent downtime? Is it all about monitoring, or about testing changes?
Sam Nurmi: Well, it’s not ALL about monitoring, but it’s a very important part. The good thing about monitoring is that not only will you always know that things are working, you will often find problems you never knew were there, or new problems you introduced by accident while modifying the service.
Set up your own internal monitoring as well as external monitoring. Both provide their own perspectives and will help you get the full picture when you want to track down performance bottlenecks or reasons for outages, etc.
Some other things to keep in mind to minimize downtime:
- Never sign long-term deals with hosting providers, ISPs or other infrastructure providers. My philosophy, one that we apply to all the services that we buy for Pingdom, is that contracts should be short, and that we should be able to cancel a contract almost immediately if we wish to do so. Service providers keep our business by delivering a good service, not by boxing us in with long-term contracts.
- Make your service and infrastructure as portable as possible. By that I mean that you should be able to move it somewhere else (switch providers, ISPs, data centers, etc) if you need to. This makes sure you cannot be put in a position where you would rather stay with a poor provider than suffer the overhead of moving.
- Try to always keep high availability in mind when building your applications, so you can upgrade or switch software and hardware without causing downtime. You don’t need to build the HA solutions yourself. There are ready-made solutions for a majority of the things that you might need, so use those if possible and let your dev team focus on building and maintaining your main applications.
- When you do make changes, test as many aspects as you can of that change before it goes live. Even then, unexpected things can happen, so be vigilant.
- It may not be for everyone, but virtualization is something that you at least should look into. Used right, it can be very helpful, cost efficient and improve your uptime.
- And this may seem obvious, but always have someone on call. If not on site, then at least available to log in and fix things remotely as soon as there is a problem. These days it’s a simple matter for someone to carry a small laptop with a 3G modem with them, and they will have an excellent platform to use even if they happen to be in a park or at the beach when something happens.
StackSafe: How do companies measure the impact of downtime incidents?
Sam Nurmi: That depends entirely on the nature of the service they are providing.
For a commercial blog, downtime can mean lost subscribers, lost ad exposure and so on. Any service that has an income from advertising will suffer the same consequences.
An e-store will obviously have closed the shop while it’s down, losing sales and potential future customers as well. They don’t just lose the sales that would have taken place, but the potential of every visitor they turn away.
Quantifying these values can be done by estimating the money the service or website pulls in per hour, and you can have fairly complex models for this. I don’t know how common it is for companies to actually do this, though.
Then there are the more intangible results, such as how it affects your image and user trust. Again, it really depends on what kind of service that is provided. If a bank website has problems, it will have a more negative effect on their image than it would for a fun news-aggregator service somewhere.
StackSafe: How does Pingdom work to prevent its OWN downtime?
Sam Nurmi: We pray.
Kidding aside, we have designed our systems to be highly reliable. Our customers need to be able to trust that we really are monitoring their websites and servers 24/7.
Our main (backend) servers are located in a first-class data center with power redundancy, diesel generators and redundant Internet connections via seven different ISPs.
The backend servers run on VMWare and all data storage is RAID 10, backed up, and replicated in real time.
Our probe servers (the servers performing the monitoring tests) are distributed in different data centers across Europe and North America, and can operate on their own for days in case there would be any problems for them accessing the backend servers.
For alerts, aside from email, we have three different SMS providers, so if one has a problem, we automatically use another one.
In other words, we have done a lot. I should add that we have a plan on how to develop the service to become even more reliable in the future, which goes hand in hand with our continued growth.
Popularity: 3% [?]
Filed Under: Downtime, IT Operations, Interviews, Interviews-Bloggers
Posted August 1st, 2008 by Joe Pendry
Downtime continues to be a hot issue, not only for servers and websites, but for e-mail. 52 percent of companies have experienced an email failure in the past 12 months, according to backup and archiving supplier, Iron Mountain Digital. Of those companies, one third had outages of two hours or longer and 17 percent were without email for more than eleven hours. Even though email outages are still a common occurrence, one fifth of companies stated they have zero tolerance for an outage. According to the research, 55 percent of companies are trying to reduce the volume of stored mail in Microsoft Exchange to reduce downtime.
Tracker Pirate Bay has experienced heavy downtime this week due to increased traffic. This traffic increase apparently caused quite a bit of stress on their server park, and they are now at the point where the current setup has trouble keeping up with the ever-growing demand.
HP, Yahoo and Intel announced this week their new cloud computing research initiative called the Cloud Computing Test Bed. This initiative will allow pre-selected researchers to build and launch new applications on the platform through the companies’ six cloud-computer research data centers, which are strategically located around the world. Researchers will examine how to make cloud computing more secure and reliable.
Yankee Group’s 2008-2009 Global Virtualization Deployment and Usage Survey validates virtualization as an enterprise solution of choice. According to the survey, about 72 percent of the businesses they surveyed affirmed that they have already ‘deployed or plan to deploy virtualization solutions. Other key findings show that 40 percent of the 750 IT administrators and C-level executives from 20 countries are deploying virtualization solutions from two or three different vendors.
Microsoft dishes about their latest project, Midori, a cloud computing operating system. Midori will focus on solely on cloud computing due to the widespread of high speed internet and because a server-style hardware system is more cost-effective. Planned release date for Midori will be post-2010.
Popularity: 4% [?]
Filed Under: Cloud Computing, Downtime, Virtualization
Posted July 31st, 2008 by Jonah Paransky
What Happened
On July 20th, Amazon’s S3 service offerings experienced a wide scale service outage. As a primary cloud based infrastructure, the outage disrupted a wide variety of websites, users and providers. The outage was heavily publicized in the mainstream media and in the blogosphere. Unlike the February S3 outage, Amazon provided significant detail about the status of S3 service availability.
Reactions
In general, the reaction to the outage seemed to coalesce around a common theme – that cloud computing is not yet ready for primetime. Some examples of prominent coverage from this perspective included:
Alternative voices were also heard. Michael Krigsman at IT Project Failures took the position that,
Customers hate outages, but accurate and responsive status reporting does help the ease the pain. Kudos to Amazon for learning from past mistakes.
Web Worker Daily takes a nuanced view, pointing out that if you require a high level of uptime, you may need a backup to S3. Mike also points out though that the rates are hard to beat and that S3 will continue to be attractive to many providers on the Internet.
Our Perspective
Communication after a downtime incident occurs:
Amazon deserves significant credit for the communication approach they took during and after the outage. After their downtime incident back in February , Amazon began providing detailed transparent information about service delivery. They also performed a significant public post mortem, rightly praised by Michael Krigsman and others as demonstrating significant maturity. This was a good example of the application of the Seven Key Lessons to Keep in Mind When Communicating an IT Failure.
Cost of Downtime:
As with downtime incidents of infrastructures used by many, the cost of the outage was significant, both to Amazon as well as the myriad of vendors who depend of the service. SLA payouts are likely due and organizations concerned about downtime are likely looking at backup options to S3 dependence.
Lessons Learned:
The Amazon S3 outage provides a number of good lessons for IT operations professionals.
Popularity: 4% [?]
Filed Under: Business Continuity, Cloud Computing, Downtime
Posted July 30th, 2008 by Jonah Paransky
Unplanned downtime has and continues to be at the top of lists of problems facing IT operations organizations. Considering the amount of focus and importance placed on the cost of downtime, one natural question is to identify the internal champion for increased uptime inside the IT organization.
To date, we have found that the job of increasing uptime is often highly distributed within the IT organization. For your organization, is the job of increasing uptime owned by:
- Architects (including Operations Architects or Infrastructure and Systems Management Architects) – who are typically concerned about designing an operational infrastructure that is scalable and robust?
- Engineering and Support Groups in IT Operations – who are often responsible for the day to day management of the IT operations infrastructure?
- Cross Functional Process Owners (such as Change Managers or Release Managers) – who manage changes, the process most often associated with the cause of unplanned downtime?
- Problem Managers - who are responsible for root cause analysis of repetitive failures?
- Disaster Recovery groups – who are often focused planning for recovery after a failure?
- Availability Managers – who, in organizations where they exist, often own defining uptime requirements and putting measurement programs in place?
- Infrastructure Outsourcers - who often deliver critical infrastructure components that become key parts of the software infrastructure stack?
- Networking groups – who own a critical part of the overall infrastructure?
- IT Operations Management – does the buck stop with the VP of Operations?
As in other business areas, spreading out responsibility is a recipe for a continuing problem area with no end in sight. Improving uptime through a continuous improvement methodology can help, but a clear lead with authority and budget can go a long way to bringing the focus and discipline to this critical problem area.
So who has responsibility in your organization for increasing uptime? Are you measuring end-to-end availability of your IT services and applications, or does the question only come up when a big incident happens? Where does the buck stop for downtime in your IT Operations group? And lastly, who are you holding accountable to improve the situation?
Popularity: 4% [?]
Filed Under: Downtime, IT Operations
Posted July 25th, 2008 by Joe Pendry
Amazon has experienced more downtime this week and had to reboot S3 on Sunday, leaving many wondering if cloud computing is really all it’s made out to be. Questions on cloud computing and reliability, SLA agreements and how much downtime is too much were asked. On a positive note, it seems Amazon has learned from their past mistakes, and made a great effort to keep users informed of the outage.
Speaking of downtime, a disaster recovery plan is often overlooked by IT operations teams. A recent article in PC World, Seven Things IT Should Be Doing (but Isn’t) points to preventing downtime and having a plan as an important rule of thumb for IT professionals. Many organizations think they have a disaster recovery plan in place, only to find out too late it’s inadequate. “You’d be surprised how much downtime happens — as well as lost goodwill from clients and vendors — when you lose your data.”
Cloud computing may be rapidly replacing virtualization as the buzz word for the year, but in reality, the job market begs to differ. A study by Infoworld states that jobs are set to drop in 2009, with almost no investment in cloud computing. Server virtualization and server consolidation are the No. 1 and No. 2 priorities. Following these two are cost cutting, application integration, and datacenter consolidation. At the bottom of the list of IT priorities are grid computing, open source software, content management, and cloud computing (called on-demand/utility computing in the survey) — less than 2 percent of the respondents said cloud computing was a priority.
Green technology remains a hot topic, with venture capital firms investing in green technologies. In fact, the recent race to “go green” has created a need for experts in carbon information and those who understand how to reduce and monitor an organization’s carbon footprint.
Martin MacLeod asks, “could the data center in a container be the ultimate virtualization platform for the enterprise or consultancy?”
Popularity: 7% [?]
Filed Under: Business Continuity, Cloud Computing, Downtime, IT Operations, Virtualization
Posted July 24th, 2008 by Dennis Powell
Change management maturity – meaning the measure of success in making and releasing changes to a production environment – is a multi-dimensional challenge. Not only do IT groups achieve different levels of change management maturity according to the practices and guidelines that they follow, but change management maturity is also determined by the type of applications for which IT is responsible.
This preceding statements formed the basis for a webinar presentation that StackSafe delivered in conjunction with Ecora Software, the webinar host. The title of StackSafe’s webinar presentation, “The Influence of Application Selection on Testing and Change Management”, presented research gathered by Research Edge for StackSafe.
The Research Edge study interviewed over 400 IT professionals across the United States regarding change testing and management practices. The companies represented by the professionals included large (1000 to 50,000+ employees generating $100M+ plus annual revenue.
This research indicated some unique differences between IT organizations in regard to the change management maturity level related to the type of application that their organization deemed most critical. So, not only is an IT organization’s change management maturity measured by its practices and methods, this maturity is also defined by the application/s the organization manages.
To set the stage, see below the types of applications that study participants (respondents) found to be ‘most’ mission-critical (acknowledging that lots of applications are being considered to be mission-critical these days):

Respondents identified unique differences in change management practice for six application types, including ERP, Transaction Processing, E-commerce, Web Hosting, Production Systems/Supply Chain Management and Customer Relationship Management.

After evaluating how respondents performed change management for these applications, we noticed some similar characteristics:
Prudent Planners - companies relying on ERP, Production/SCM, and/or Transaction Processing systems that suffered painful impacts when these systems experienced downtime. These companies performed the most rigorous testing of all groups.
ERP companies are likely to test the entire infrastructure stack when testing the impact of a change because they are the most concerned about the complexity of multi-tiered applications. They also tend to have the least downtime per change ratio certainly driven by the fact that their cost of downtime is the highest.
Transaction Processing companies task more of their IT personnel to participate in development as well as testing, likely because the longevity and maturity of TP technology means that IT has a deep base of experience with the system logic and process. TP companies also expressed the most satisfaction with the results of pre-production testing.
SCM companies were the least mature of the Prudent Planners in that they tended to perform component rather that end-to-end testing, and found pre-production testing to be cumbersome. However, they also had fewer production problems due to change than others.
High Stress Environment - companies support CRM and E-Commerce applications. They operate in a more volatile environment with high numbers of emergency changes, and have less confidence in the stability and reliability of changes. There was a strong expressed desire from management to reduce the cost of testing and correcting changes.
Laissez-faire Planners - companies that view website hosting as their most important applications. Over 90% have invested in a staging platform, but only 32% use automated change management tools. More than 50% test the entire infrastructure stack when they test the impact of a change, but only 20% of Laissez-faire Planners test all changes that they must make to production.
We will post a link to the webinar shortly so you can watch it directly, and feel free to contact StackSafe to learn more about our research. Meanwhile, pay close attention to your application type – it might explain why you achieve the testing results that you do.
Popularity: 7% [?]
Filed Under: Change Impact Analysis, Change Management, Downtime, Testing
Posted July 22nd, 2008 by Joe Pendry
IT’s About Uptime blogger and StackSafe Sr. Product Manager Dennis Powell will be presenting on a webinar titled “The Influence of Application Selection on Testing and Change Management.” The webinar, hosted by Ecora Software, will discuss findings from the latest study conducted by StackSafe and Research Edge about testing maturity and complexity of various applications. The webinar will also cover:
- The three main approaches to change management and testing
- Why change management adoption can be more difficult in high stress environments
- The positive role downtime expenses play in the adoption of testing and change management processes
- The types of organizations that tend to have the most “laissez-faire” attitude towards change management
You can register now by clicking here.
Popularity: 7% [?]
Filed Under: Change Management, Downtime, StackSafe Corporate, Testing
Posted July 18th, 2008 by Joe Pendry
During the first half of 2008, Royal Pingdom surveyed 13 of the top news websites in the world and found that five of them had more downtime then the other eight. Those who had 99.9% uptime were Forbes, New York Times, CNN, Voice of America, Washington Post, Bloomberg, BBC News and Guardian Unlimited. However, out of the five ranked with the longer downtime, International Herald Tribune, Times Online and ABC News had the longest continous outages. Overall, the news sites had excellent uptime.
VMware’s ThinApp 4.0 debuted this week. Know as the company’s first application virtualization project, the software allows users to run ‘any version of any application on a single operating system without any conflict.’ ThinApp offers new features such as Application Sync and Application Link. With this release, Microsoft and Citrix can expect more competition.
Citrix unveiled its new weapon in the virtualization sector: Project Kensho, which acts like the middleware between applications and microkernel-based hypervisors. The global leader in application delivery infrastructure noted as virtualization continues to become mainstream, ‘users need ways to automate and secure the lifecycle of their application workloads without being tied to a single hypervisor platform or virtual hard disk format.
Twitter is testing a new design for its users. Apparently, there have been ‘sightings’ of this new design upon login and once the page is refreshed – poof – the design is gone. The reason for the new look might be in conjunction with the acquiring of Summize. We wonder what kind of staging and testing processes Twitter is following during this phase to prevent (more) Twitter downtime.
Popularity: 11% [?]
Filed Under: Downtime, IT Operations, Virtualization
Posted July 17th, 2008 by Dennis Powell
I wanted to share some concepts from an interesting keynote address from the America’s SAP User Group (ASUG) meeting being held in Toronto this week, which StackSafe helped sponsor (thank-you-very-much).
The ‘theme’ of this UG was all about upgrading SAP, primarily from several earlier versions to ECC 6.0 and EP7. If you’re a SAP upgrade manager, you’re familiarized with the planning, preparation, communication, negotiation, and scheduling to prepare your organization to perform an upgrade of something as mission-critical and integral to the business as an ERP. And that’s before you even begin to actually upgrade.
Mr. Don Whittington, CIO of Florida Crystals, provided both an entertaining and thought-provoking keynote regarding his company’s experience with upgrading to SAP v6.0. Florida Crystals is a ‘field-to-shelf’ sugar producer, a 24/7 operation that works with a number of well-known companies. I’m not going to discuss the details of his upgrade, other than to say “Wow!” that Florida Crystals experienced zero downtime and zero missed shipments after completing the upgrade. Don was obviously proud of the IT group for delivering this accomplishment, and for an operation of this size and scope, this result is truly impressive.
Throughout Don’s presentation, he liberally mixed advice about the overall IT business approach within the discussion of the practical upgrade process. Consider a few points from his talk:
- IT is at its best when it goes unnoticed – Like any technology, service, process, or activity taken for granted, IT is doing the best possible job when the rest of the organization doesn’t notice them – because everything is working as expected. Don used the analogy of a phone service: users press the keys and expect the call to go through.
- Set your customer’s expectations – IT gets noticed when others set its priorities. If allowed to set the business priorities associated with an ERP (or any key application) upgrade, a business owner will expect no disruption, including no downtime, no loss of performance, no change to the way of doing business, etc. These objectives are important and need to be taken seriously; but as the provider, IT needs to have the final say. In reality, there will be disruptions. IT should host the ‘ERP upgrade party’ and set the expectations up front, both in terms of the risks, the potential disruptions, as well as the value gained by the upgrade. Noting ‘value’ brings up an amusing aside – Don ran through a tongue-in-cheek “Typical Technical Upgrade” checklist (I paraphrased):
- Postpone the value-add IT services for a significant period of time
- Assign significant human capital for upgrading
- Procure significant capital for contract services
- Procure additional capital for addition infrastructure
- Aggressively mange the process
- Communicate progress to all
- Work through the conflicting priorities…
And the successful end-result after all of this expense and effort, is that everything works exactly as it did before. As Don said, “Try selling THAT expectation”.
- IT must speak the language of business –“Revenue”, “margins”, “and related business value terms should be a regular part of the IT management lingo – not just to make IT sound more like the well-rounded employee, but because the business managers and users don’t understand and won’t relate to Java upgrades, 32b to 64b conversions, or other eye-glazing technology verbiage. When your user doesn’t understand how you deliver value to the business, they will set the expectations for how you should deliver value to the business – likely in ways that you can’t deliver.
Don also presented some statistics showing:
One final thought: Leading business innovation is the one place where IT always wants to be noticed and the person who should always notice is the CIO.
Popularity: 9% [?]
Filed Under: IT Operations