The Dirty Half-Dozen Explained: The Implications of Taking Shortcuts When Testing Changes

Posted May 5th, 2008 by Joe Pendry

A couple of weeks ago, we wrote about the shortcuts that IT Operations teams take when they need to make changes to their environments. In that post, we explained why taking such shortcuts is so appealing. Testing the right way tends to require considerable resources and budget when done correctly. Without a mature, comprehensive testing and analysis toolset, IT Operations must make sacrifices in the area of software infrastructure testing.

Nonetheless, there are real implications that must be understood when shortcuts are taken. Below is a list of the “Dirty Half-Dozen” testing shortcuts with more description about the problems they cause and why they so often lead to downtime for IT Operations.

Shortcut #1 – Patch and Pray

  • Implication: Obviously, patch-and-pray approaches increase the risk that the change will result in production downtime. Further, no testing whatsoever makes finding the cause of a problem much more difficult. If you are implementing the patch on a Friday, don’t make any weekend plans.

Shortcut #2 – “Non-Customer-Facing” Tests

  • What it is: Deploying new configurations and patches on live systems that are low-priority or non-customer facing applications.
  • Rationale: “I need to test this change quickly, but I don’t have a test environment that looks like full production. I’ll use the group that is least likely to complain as my guinea pigs and roll the change out to full production if my blackberry doesn’t explode with angry users.”
  • Implication: Two problems here. First, the change is still getting into the live environment, so there is some level of risk being assumed. Second, limiting testing to one group might not be representative of the entire environment. Naturally, incomplete testing methods like this limit the ability to understand the full impact of the changes.

Shortcut #3 – Scheduled Downtime

  • What it is: Testing the change in production during scheduled downtime.
  • Implication: Do you have the luxury of scheduled downtime? It is increasingly hard to come by in this era of high availability. Also, any changes made to the production environment introduce risk of data and process corruption, which may prove unacceptable to more sensitive systems.

Shortcut #4 – Borrowing Redundant Resources

  • What it is: Taking redundant disaster recovery (DR) systems offline for testing purposes.
  • Rationale: “My DR systems look a lot like the live environment. I’ll use them for a quick test.”
  • Implication: If you are using disaster recovery systems for testing of this sort, you’d better not have an actual disaster…or there could be trouble.

Shortcut #5 – Component Testing

  • What it is: Limiting testing to an individual software component or configuration change rather than testing the change as part of a full end-to-end distributed stack test.
  • Rationale: “I have a portion of my IT service that I think is fairly representative of the entire environment. I’ll test the change on this component first and roll the change out to full production if nothing goes wrong.”
  • Implication: Similar downsides to the “Non-customer” facing tests. It is difficult to gauge how well representative any segment is of the entire end-to-end application. Also, putting a change into any live environment – no matter how well segmented – assumes some level of risk.

Shortcut #6 – Tabletop Testing

  • What it is: – Evaluating changes using documents, spreadsheets and diagrams on a conference table.
  • Rationale: “My team has seen changes like this before. They’ll be able to give me a general sense of what I can expect if we hold a meeting along the lines of a miniature change advisory board.”
  • Implication: Table-top testing becomes the only practical method when the changes and/or the environment to be tested is too large or complex to duplicate, or when there is a lack of resources and time. However, this approach provides no actual insight into the true impact of the change on production environments.

All of the above approaches have significant limitations that make them impractical or inefficient for understanding the business impact of changes to the software infrastructure. As a result, in far too many cases, IT Operations is forced to make changes and adjustments in non-representative staging environments, and to live production software infrastructure, without any testing whatsoever. This practice leads to the serious failures and downtime that organizations try so desperately to avoid.

Popularity: 10% [?]

Filed Under: Change Impact Analysis, Downtime, Testing


Leave a Comment