Monthly Archives: May 2011

Cloud Multi-tenancy Includes Network Bandwidth

There’s been a lot of discussion about multi-tenancy since the arrival of cloud computing. Multi-tenancy is the act of hosting multiple non-related resource allocations on the same hardware. For example, with regard to virtual machine instances, the concepts surrounding multi-tenancy are multiple virtual machine instances all running on the same servers sharing the CPUs, memory, and network adapters. With Software-as-a-Service (SaaS), multi-tenancy is the represented by multiple clients all sharing the same application, but ensuring their data is properly partitioned.

One consequence of poorly-behaved tenants is that they can starve out other well-behaved tenants if controls are not put in place to limit consumption of resources, which often they are not. The reason why limits are often not either enforced or implemented is because the implementors believe they can scale up and out to meet the demand. After all, this is the promise of cloud computing.

However, not all resources scale equally and all re-allocations out of the pool could negatively impact other tenants’ ability to scale. One area where this is often forgotten is in network bandwidth. While it’s possible to control how much CPU time or memory a given machine instance can use, it is more difficult to control how much network bandwidth the processes of a particular tenant will use. Hence, one client could cause a denial of service attack accidentally for co-residing tenants if they are all sharing the same network adapter or running on the same switch.

What cloud service providers often forget is that everything in the cloud uses the network bandwidth. That means in addition to your satisfying client requests, those same client requests often generate additional traffic. This traffic may be communications with the storage network, database and application servers, or even the hypervisor substrate reallocating and re-balancing loads. At the end of the day, the physical limitations of the network are limited even in the case of fiber-based connections, which are great for internal communication, but rarely extended end-to-end. In the end, rarely does anyone measure or understand the aggregate network load caused by a single tenant in a multi-tenant architecture and in some cases it may not even be possible to directly assess a particular network load to a single tenants use.

Beyond the obvious possible performance impact, why does this matter? When designing solutions for the cloud, it is imperative that you test your production environment’s real operating performance and not rely on the specifications provided by the cloud provider. Your machine instance with two-cores operating at 2.5Ghz, 2048 megabytes of RAM, 500 gigabytes of disk space and 1 gigabyte network adapter does not equate 1:1 in performance. You will need to instrument your application running in this environment and not assume that it will operate as it did in the test environment or, if migrating, the way it ran on your prior infrastructure. This means obtaining average IOPS for storage, average bandwidth over a reasonable usage period, and response times around key processes.

Originally posted at:

Dependency Creep Can Impact Your Cloud Migration Strategy

With Cloud Computing emerging on the scene as a solution to a number of computing use cases, it will drive modernization of your existing systems. Perhaps it’s just a new interface for driving mobile access to corporate data or consolidating standalone servers into a Cloud for achieving greater utilization from fewer resources. In either case, the number of dependencies between system tiers is increasing shrinking the distance between them.

While we have been trained to believe silo systems are inefficient, they are much easier to develop businesses continuity plans around. In recent discussions with a Director of IT from an insurance company, it was relayed to me that the processes for managing in face of a disaster for his mainframe was well-tested every year for the past ten years. While his process still takes between four and five days to complete, the nature of restoring the mainframe systems are straightforward: turn on disaster recovery site hardware, load applications, load last known data, run missed jobs, etc.

The problem for this individual is that users have progressively stopped using these applications directly in the past five years. Instead, these users are now accessing the mainframe data through a hierarchy of applications that have been built over the years. Moreover, these applications have not been developed as part of a cohesive strategy that incorporates them into the business continuity planning in face of a disaster. So, now, there’s a host of applications that are all feeding each other and there’s no roadmap detailing the connectivity and flows between these applications nor a plan for the order in which they need to be restored in the event of a disaster. We call this dependency creep.

As difficult as dependency creep is as described above, if each of these applications are deployed on silo hardware in a single data center, there’s an opportunity to catalog these applications after the fact. When these applications move into the Cloud and may be distributed across public and private nodes and integrated with other public services, the dependency creep becomes unwieldy and unmanageable. Moreover, without appropriate levels of communications between engineering and operations, these dependencies can become recursive. These recursive dependencies work fine when all services are up and available, but can be extremely problematic to restore if a very specific ordering is not followed.

So, what can you do today to avoid these issues later? First, initiate the development of a DevOps initiative within your enterprise. DevOps fosters communications between engineering and operations so that applications are developed with an understanding of how they will be deployed and operated. When operations and engineering operate in isolation, applications work fine in a pristine test environment, but tend to fall over when deployed in production. Engineers must understand the production environment that their application will be running in and operations must understand how the application works and how it is designed. Building a DevOps group may require individuals to learn new skills to support this effort.

Secondly, develop your IT services catalog. Your service catalog will provide you with the means of identifying dependencies. Unfortunately, the organic nature of IT means that we have built systems with the spaghetti-like interconnections that we typically associate with bad software development. Untangling those dependencies is going to take a concerted effort, but is critical to not only ensuring that you can survive a disaster but that you can respond to less critical outages as they occur without the “fire-drills” that typify many operations environments.

Migrating to the Cloud offers multiple benefits and offers opportunity to solve problems that were previously cost prohibitive. However, any time you open the doors to the data, it seems that the line rapidly forms to consume that data; or as it is commonly known as “build it and they will come!” Furthermore, once the business taps that source, ecosystems will build around it that includes business processes and applications. These dependencies must be cataloged, managed and incorporated into your business continuity planning or it’s very likely your business will be significantly impacted by service outages.