The Data Center Divulged: Grids and Blades of Glory
Posted March 31st, 2008 by Joe PendryWe had a chance to chat with Blade Watch editor Martin MacLeod this week, who shares his views on blades, grids and more. Martin discusses trends impacting datacenters, the state of virtualization in datacenters, and thoughts on risks and trends within the financial services industry and beyond. Martin is experiences in a variety of IT projects from deploying new servers, decommissioning old ones, participating in server and data center migrations, blade deployment VMWare and grid projects.
This is part one of a two part series.
StackSafe: How did you get interested in “tracking blades & grids”?
Martin MacLeod: I’ve always been interested in blade servers, what their benefits were as well as what issues people have had with them. There were those that said blades were too power hungry, or used laptop hard drives, they lacked expansion, and what about resilience. At the same time, there were those that had tremendous success (including myself), that by deploying another 300 blades in their grid solution found their overnight batch times reduce from days to hours.
I wanted to see what issues people were having, how they over came them, and how we could help people see the benefits of the technology - to know the things to look out for - what questions to ask.
In terms of grid, the main interests is in seeing the business benefits people have with the technology as well as watching how people fit this technology within their business. Is their a grid team? Does each application have it’s own grid infrastructure? Is it a shared grid - if so do we charge for it? If so how do we charge for it? It’s a technology like virtualization that doesn’t really fit within the traditional infrastructure/application mindset. In essence it brings the concept of IT as a service, which is terribly exciting.
The blog therefore started off reporting what was in the news to keep me informed with what’s been going on - what I should know - what cool things are coming along, as well as any hints/best practice or even basic commands that I’ve found as I’ve spoken to people, deployed or supported the technology.
StackSafe: What is the most important trend you see impacting the data center today?
Martin MacLeod: One of the main trends has been the concept of aligning the IT to business need. That the IT should be more responsive, proactive to my business needs and as dynamic as my business to change. We want to bring online a new grid application to help with our risk calculations, how can we do this quickly, effectively and within budget. Underpinning this though is your data center capacity.
When we think of capacity this is in terms of physical space, how many servers we can fit in our racks, how much ‘u’ space we have left. There’s also the energy and cooling capacity, every time we deploy another server, plug in more network kit or SAN storage, we impact how much power and cooling we have left within the current data center constraints. We only have so many watts to play with (before we upgrade capacity or even move to another data center). Therefore deploying that 8u rack server might meet the needs of Mike in HR, but reduce the capacity for the application team from deploying that blade solution to reduce the calculation times for the traders – who decides which unit has priority? What is the business value against the watts/cooling consumed? Does the application need rated in terms of business criticality or revenue generated?
As more companies see their data center as a finite resource, data center capacity becomes a business risk, a source of real focus. Finite in terms of cost, I can deploy another data center, or move to a bigger facility but there is a significant cost in doing so and a potential for business risk/outage. Therefore what we are doing with this resource, where the inefficiencies exist, what ratio of IT to application servers we have – how much of the capacity is revenue generating or revenue neutral becomes increasingly important. Are IT server heavy? Do they have lots of servers idling away when they could be replaced with trading systems, servers earning me money or acting as a business benefit?
Linked to capacity is reporting and management, not just in terms of how many 3u servers do we have, but also disclosure in terms of the carbon footprint of the data center? As more companies announce that they’re going carbon neutral or announce the steps they are taking to limit their environmental impact, what does this mean for the data center? For IT teams and service delivery?
Capacity changes from the usual kind of report you might have generated in the past:
- How many servers you have
- How many servers per platform – x86/Alpha/Risc
- How many servers per operating system – Windows/Linux/all the types of UNIX
- How many servers per business line – IT have 55, Trading have 130, HR have 3, administration have 22
- To a more business aligned view, one in which I can see the business or the application centric view of my data center: (We might still have the same reporting requirements, but we need to also provide the following)
- How many servers are front office/back office/administration
- What is the watt rating of an application or business line
- What systems are using the most power/cooling and to which business lines do they belong
- What systems are over three years old etc – and to which business lines/application teams to they belong
- If we outsourced the grid solution or moved it to another data center, what’s the power/cooling reclaimed?
StackSafe: Our blog focuses on maximizing uptime of critical applications. What do you see as the biggest threat to this goal that IT operations teams face?
Martin MacLeod: A mixture of things coping with rapid application development, managing system capacity and handling failure. Maintaining the integrity of the production (customer facing) systems for an industrial strength infrastructure.
Firstly ensuring that our development and UAT systems match the production systems so that we can accurately test application code or releases prior to production is important in preventing unexpected behaviour, in limiting system downtime or issues with the user experience. That the application doesn’t close a handle when it’s no longer required might be fine if we have 10 users, but if we scale that up to 100, to 1000, how does that affect the server reliability. Managing system capacity is important in terms of simple things like network throughput, storage or cpu/memory; as well as user experience/application capacity. Being able to highlight our current system capacity, what it will be once we bring on for example Hong Kong, and when we might need to purchase more infrastructure, more grid engines or web servers.
Handling failure, being able to have an infrastructure with a degree of resilience, the concept that I understand failure might occur, but the system needs to have a degree of fault tolerance built in so that the user experience isn’t affected, I’ll cope with having to refresh the screen, but I wont wait 45 minutes because your cluster has failed over and we’re waiting for data replication. This requires an effective working relationship between the application teams and the IT teams, that there is sufficient investment in the infrastructure and evolution of the application code; that we continue to innovate to stay in line with the vendor support matrix, that the firmware, the software updates and patches are applied to prevent known issues and vulnerabilities, to keep the infrastructure up to date.
StackSafe: What advice would you offer to companies that are just now preparing to deploy virtualization in the data center?
Martin MacLeod: A good question. Understand what it is in scope; what platforms and operating systems are in scope, are we virtualizing the database applications, do we have an expected ratio of virtual machines to physical blades or rack servers, can production (customer facing) and development/infrastructure systems reside on the same infrastructure?
Related to the scope is the objective - are we seeking to consolidate more servers on to fewer physical assets? To abstract the application team from the physical asset? With these questions in mind you need to think what issues might arise within your business when deploying a virtual infrastructure, the technical (if you need new infrastructure) and the non-technical - the process/service delivery issues such as accountability, ownership, billing and provisioning.
Ownership is the main one, who owns the ESX servers? If the business lines own the ESX servers does that not indirectly mean they own the infrastructure - that they can dictate the ratio of virtual machines per server? We only need one application team to fill up their ESX server, for the users to find the system slow and to find that suddenly, we move away from virtual machines back to physical ones - the ‘virtual machines are rubbish’. Accountability is very important, the concept of one unified voice to respond to production issues, to issues of service delivery as well as standards for the infrastructure - that when issues arise, there is one team responding and answering questions, one team dictating best practice.
Related to this - who owns the hypervisor and therefore supports it? Unix/Windows teams? Billing and provisioning are kind of related. I mention provisioning in terms of building the virtual machine, of making a change to it’s configuration - how long does it take to upgrade a virtual machines’ disk, or increase the RAM? Billing in terms of who pays for the infrastructure, how we pay for a virtual machine and at what cost, is there a set configuration that we get for that cost, is it per cpu hour or per instance? Do I pay more for a virtual machine with more memory or more storage? How do we manage and report on this virtual infrastructure, do we need to change the management or reporting mechanisms – instead of how many servers do we have, is it the capacity of the virtual estate or which virtual machines run on which physical asset.
Popularity: 11% [?]
Filed Under: Downtime, Interviews, Interviews-Bloggers, Virtualization















Leave a Comment