Sam Nurmi of Pingdom Talks About Uptime and Downtime
Posted August 4th, 2008 by Joe Pendry
Founder of Pingdom, Sam Nurmi, shares his reasons for starting the uptime monitoring service and his views on downtime. Prior to Pingdom, Sam was the CEO of Sweden’s biggest web hosting company, Loopia, which he sold in 2005.
Pingdom oversees uptime monitoring needs for 90% of the companies in the world, promising to maintain the best uptime monitoring service available. The technology behind Pingdom is developed in house, which gives them an unparalleled ability to satisfy both the current and future needs of the market. The company blog, Royal Pingdom, provides results from their research on downtime/uptime.
StackSafe: What made you want to get into uptime monitoring?
Sam Nurmi: In 1999 I founded the Swedish web hosting company Loopia, which within a few years was the largest web host in Sweden. Running Loopia gave me a lot of insight into the entire hosting industry and a lot of connections within the industry.
As a hosting industry insider I quickly learned how common outages actually were for various hosting companies, ISPs and for other services running on to the Internet. The thing that struck me was that most companies depending on this infrastructure simply didn’t have a clue about the extent of the problems. And after all, who visits their own homepage (or other online services) every minute 24/7? People just assume that things work. The only way to know was to set up some type of automatic monitoring.
When I initially looked into the uptime monitoring industry a few years ago I realized that the existing services on the market were really not built for the mass market. They were often difficult to use, expensive, and didn’t appear to be very scalable. Our basic business idea here at Pingdom is therefore to deliver an easy-to-use, reliable monitoring service at a reasonable price. Since we are aiming at the mass market (relatively speaking, we’re not Facebook), high volumes make it possible to offer low prices, and our system is designed to be able to handle the volume. Thanks to this approach, Pingdom is one of the fastest-growing uptime monitoring companies in the world today.
Another major point is that the Internet is still very young, and the Web is basically still in its bud. Even though the Internet is so young, a lot of the world’s economy already depends directly or indirectly on this global network and the services running on it. This means that the demands on its reliability will become even more important over time. Effective monitoring of this infrastructure is a huge, growing market.
StackSafe: Based on the case studies/examples you have seen, what seems to be the most significant source of downtime?
Sam Nurmi: The most common reason for downtime is really hard to pin down since there are so many factors involved. Application updates or changes can cause problems, network malfunctions are relatively common, and issues related to scaling, for example a traffic spike overloading a web server (Slashdot/Digg effect for a blog, to name one example).
Shared hosting environments come with their own set of problems, where the “neighbors” can affect the performance or uptime of the whole hosting environment either through heavy usage or misconfigured applications. These problems can be made worse by so-called overselling.
Then there should also be a distinction between short outages and outages that last for a long period of time. A common reason for longer outages is when people managing a service, for example a website, are not aware that there is a problem. Of course they can’t fix it if they don’t know about it. They will find out somehow after some time has passed, often through their own users, but that’s not a good solution, obviously.
StackSafe: What kinds of organizations see the most downtime? Ecommerce, media/news, applications, etc.?
Sam Nurmi: There is no clear pattern that any one industry or market suffers from more downtime than others. There are big individual differences in service quality within companies in the same sector, something we can clearly see among our customers and surveys we have done in the past. However, social media services tend to have more visible outages for the simple reason that they have so many users, and those users visit the site frequently.
An interesting thing you see quite often is that a service can run perfectly during a long period of time, and then all of a sudden have a significant performance degradation. It could be a response time increase or more downtime, or both. Common reasons for this are surges in user numbers, software updates or other modifications to the service that have unexpected side effects. Another reason can be that the hosting provider or ISP has started serving beyond its capacity, for example too many customers on the same server (or cluster) or having too many customers sharing the same Internet connection.
StackSafe: What are some measures companies can take to prevent downtime? Is it all about monitoring, or about testing changes?
Sam Nurmi: Well, it’s not ALL about monitoring, but it’s a very important part. The good thing about monitoring is that not only will you always know that things are working, you will often find problems you never knew were there, or new problems you introduced by accident while modifying the service.
Set up your own internal monitoring as well as external monitoring. Both provide their own perspectives and will help you get the full picture when you want to track down performance bottlenecks or reasons for outages, etc.
Some other things to keep in mind to minimize downtime:
- Never sign long-term deals with hosting providers, ISPs or other infrastructure providers. My philosophy, one that we apply to all the services that we buy for Pingdom, is that contracts should be short, and that we should be able to cancel a contract almost immediately if we wish to do so. Service providers keep our business by delivering a good service, not by boxing us in with long-term contracts.
- Make your service and infrastructure as portable as possible. By that I mean that you should be able to move it somewhere else (switch providers, ISPs, data centers, etc) if you need to. This makes sure you cannot be put in a position where you would rather stay with a poor provider than suffer the overhead of moving.
- Try to always keep high availability in mind when building your applications, so you can upgrade or switch software and hardware without causing downtime. You don’t need to build the HA solutions yourself. There are ready-made solutions for a majority of the things that you might need, so use those if possible and let your dev team focus on building and maintaining your main applications.
- When you do make changes, test as many aspects as you can of that change before it goes live. Even then, unexpected things can happen, so be vigilant.
- It may not be for everyone, but virtualization is something that you at least should look into. Used right, it can be very helpful, cost efficient and improve your uptime.
- And this may seem obvious, but always have someone on call. If not on site, then at least available to log in and fix things remotely as soon as there is a problem. These days it’s a simple matter for someone to carry a small laptop with a 3G modem with them, and they will have an excellent platform to use even if they happen to be in a park or at the beach when something happens.
StackSafe: How do companies measure the impact of downtime incidents?
Sam Nurmi: That depends entirely on the nature of the service they are providing.
For a commercial blog, downtime can mean lost subscribers, lost ad exposure and so on. Any service that has an income from advertising will suffer the same consequences.
An e-store will obviously have closed the shop while it’s down, losing sales and potential future customers as well. They don’t just lose the sales that would have taken place, but the potential of every visitor they turn away.
Quantifying these values can be done by estimating the money the service or website pulls in per hour, and you can have fairly complex models for this. I don’t know how common it is for companies to actually do this, though.
Then there are the more intangible results, such as how it affects your image and user trust. Again, it really depends on what kind of service that is provided. If a bank website has problems, it will have a more negative effect on their image than it would for a fun news-aggregator service somewhere.
StackSafe: How does Pingdom work to prevent its OWN downtime?
Sam Nurmi: We pray.
Kidding aside, we have designed our systems to be highly reliable. Our customers need to be able to trust that we really are monitoring their websites and servers 24/7.
Our main (backend) servers are located in a first-class data center with power redundancy, diesel generators and redundant Internet connections via seven different ISPs.
The backend servers run on VMWare and all data storage is RAID 10, backed up, and replicated in real time.
Our probe servers (the servers performing the monitoring tests) are distributed in different data centers across Europe and North America, and can operate on their own for days in case there would be any problems for them accessing the backend servers.
For alerts, aside from email, we have three different SMS providers, so if one has a problem, we automatically use another one.
In other words, we have done a lot. I should add that we have a plan on how to develop the service to become even more reliable in the future, which goes hand in hand with our continued growth.
Popularity: 7% [?]
Filed Under: Downtime, IT Operations, Interviews, Interviews-Bloggers















August 4th, 2008 at 3:50 pm
Hi Joe, great interview. Sam has some good insights regarding downtime. At Marathon Technologies we are monitoring some of the same market trends as the Pingdom team; but rather than being able to identify what industries see the most downtime, we have noticed that the early adopters of HA and uptime monitoring solutions have a tendency to fall in the financial, broadcasting, healthcare and pharmaceutical industries – reason being is that the costs and risks of downtime are so much greater as a result of increased regulation requirements.
As for things to keep in mind to minimize downtime, a tip that our CTO Jerry Melnick has discussed is: spending the time to plan. As any small business owner knows, implementing a new system requires dedicated resources, budget and time. Industry experts have estimated that the planning stage constitutes 90% of an implementation project whether its server virtualization, HA or something else. The actual migration is relatively simple to undertake. A thorough implementation plan will help businesses minimize any hiccups that might arise. Since Sam discusses virtualization as an option to minimize downtime, take a look at Jerry’s top tips for ways to get started with server virtualization.
Thanks again for the insights!