Geographically dispersed cluster design

Nowadays more and more companies have or are considering a 2nd datacenter in another site. Mostly the main reason for this 2nd datacenter is for disaster recovery purposes. There are 2 ways to utilize this 2nd datacenter. First one could choose for an active-passive setup, where in the 2nd datacenter bare-metal servers are waiting to be utilized when disaster strikes. On the other hand one could choose for an active-active setup. This way you can spread all active servers across the datacenters, which decreases impact when disaster strikes in one of the datacenters.

Implementing an active-active datacenter setup is also another step forward in the direction of true cloud computing, because IMHO a single datacenter can’t be a cloud by itself. As it comes to VMware in an active-active datacenter setup, there are several important things to keep in mind.

On June 29th VMware and Cisco posted a proof of concept document for VMotion between datacenters. As VMware and Cisco mention in their article you have to stretch the L2 networking domain between the sites. This is one of the most important requirements if you want to stretch you active footprint across datacenters. VMware and Cisco also mention that after the VMotion the VM has to remotely access its disk in the other site until a Storage VMotion occurs. This storage part is one of the major challenges you encounter when designing your virtual infrastructure across sites.

Running an active-active datacenter in a VMware environment got me thinking. What are the possibilities and what is smart using nowaday’s technology. I already mentioned the first requirement, which is stretching the L2 networking domain across the datacenters. So let’s assume that this requirement is already in place.

Figure 1

Let’s take it back one step and have a look at an active-passive setup. These setups have some sort of storage replication in place. The most common design I encounter is showed in figure 1. In the main datacenter there’s an ESX cluster with some sort of SAN based replication/mirroring to a second datacenter. In the second datacenter there is a passive ESX cluster available to start-up the virtual servers in case of disaster. Let’s use this setup as a starting point and turn this active-passive into an active-active setup.

The solid blue lines represent active storage connections and the dashed brown lines represent passive storage connections.

Scenario 1: Divide ESX cluster between datacenters

Figure 2

This design mirrors the active-passive setup example and just divides complete ESX clusters between datacenters. This design of course requires you to have multiple ESX clusters. Using this design you lower the impact when disaster strikes, because you have divided your active ESX clusters between 2 datacenters.

Advantage:

  • Quite simple design because it’s a mirror of the active-passive design.
  • There is no need to change existing cluster setups. Just move a complete cluster to the 2nd datacenter.
  • All active disk-IO stays local to the datacenter.

Disadvantage:

  • Requires an extra passive ESX cluster in the main datacenter for disaster recovery.

Scenario 2: Stretched ESX cluster 1

This design stretches the ESX cluster over 2 datacenters by putting half of the ESX hosts in the cluster in datacenter 2. All active storage stays in datacenter 1, which is still a single point of failure, because when datacenter 1 goes down, the VMs in datacenter 2 will go down with it. Because of the SAN based replication you could assign the mirrored storage to the cluster in datacenter 2 and startup the VMs as a disaster recovery solution. If datacenter 2 goes down, the ESX cluster in datacenter 1 can take over the crashed VMs if your resources in datacenter 1 allow it.

Another point of concern is the placement of your HA primaries. If you have a large cluster and have more than 4 ESX hosts in a datacenter, it is theoretically possible that all 5 HA primaries reside on one datacenter. If that particular datacenter goes down, VMware HA will not work. The maximum supported number of HA primaries is 5 and you cannot control their placement, although there are some undocumented and unsupported possibilities.

Figure 3

Because the active storage resides in datacenter 1, VMs running in datacenter 2 have to access their storage across the WAN link. This introduces additional latency and will degrade performance. If you use synchronous SAN replication or mirroring this latency is multiplied by 2 for all writes because of the extra write-back you have to wait for (synchronous copy) to datacenter 2. This means that every write IO operation will suffer 4 times the latency of the WAN link. When using an asynchronous replication technology, you don’t have to wait for the replication to finish and every write IO operation will suffer only 2 times the latency of the WAN link.

The path a write IO operation follows is illustrated as the numbered red lines in figure 3:

  1. Write IO operation from the VM to the disk (primary storage box)
  2. Remote copy write operation from the primary storage box to the remote disk
  3. Write IO acknowledgement from remote storage box to primary storage box
  4. Write IO acknowledgement from primary storage box to the VM

Advantage:

  • No passive ESX hosts needed in both datacenters, but you might need extra capacity dependant on your disaster recovery requirements. Although the extra capacity is shown as passive in figure 3, it can of course be fully utilized as active in the cluster.

Disadvantage:

  • Shared storage is active in only one location, which is a single point of failure for the running VMs. If this location goes down, all VMs go down.
  • There is no control over your HA primaries, which could result in VMware HA not working. To ensure that HA primaries reside in both datacenters, your cluster can’t exceed 4 hosts per datacenter if you stretch across 2 datacenters.
  • All VMs in datacenter 2 have to access their storage in datacenter 1, which will decrease performance. If you use synchronous SAN mirroring this latency is multiplied by 2 for all writes.
  • If VMware DRS is enabled, VMs can be automatically moved between datacenters, which impacts performance if the VM is moved to a host that is not local to the storage.  

Scenario 3: Stretched ESX cluster 2

This design is similar to scenario 2, but now we also divide the active storage between the two datacenters. This way every VM accesses its storage local to the datacenter.

Figure 4

DRS in this setup can be killing. Because DRS in unaware of the stretched design, it could VMotion a VM to the other datacenter.

Advantage:

  • No passive ESX hosts needed in both datacenters, but you might need extra capacity dependant on your disaster recovery requirements.
  • Active storage in both datacenters, so active disk-IO stays local to the datacenter.

Disadvantage:

  • As in scenario 2, there is no control over your HA primaries, which could result in VMware HA not working. To ensure that HA primaries reside in both datacenters, your cluster can’t exceed 4 hosts per datacenter if you stretch across 2 datacenters.
  • You can’t use DRS, as DRS has no such thing as site affinity. If you enable DRS, DRS might VMotion a VM to a host in the other datacenter which results to all active disk-IO for that VM traveling across the WAN link and consequently impacting VM performance.

Scenario 4: Split ESX cluster 1

Figure 5

This design simply splits an ESX cluster into 2 separate ESX clusters which are divided across the datacenters. If we look closer to this design, this design is in fact a variant of scenario1. Besides splitting an ESX cluster you can also combine two separate clusters if business policy allows it, but you need to make sure that storage and networking for both clusters is equally configured before combining the clusters or else they can’t be each other’s failover in case of disaster.

Advantage:

  • No passive ESX hosts needed in both datacenters, but you might need extra capacity dependant on your disaster recovery requirements.
  • Active storage in both datacenters, so active disk-IO stays local to the datacenter.

Disadvantage:

  • You can’t split ESX clusters with less than 4 hosts, because that would result in a cluster that is not redundant. To resolve this issue, you need to add extra ESX hosts so that you have at least 2 servers in each datacenter.

Scenario 5: Split ESX cluster 2

This design is a simply a combination of scenario3 and scenario4, which at first sight has the additional ability to VMotion/Storage VMotion VMs between the clusters or datacenters. But beware! If you configure this, both the active storage and its replicated counterpart are assigned to the same cluster. As far as I know, this will generate errors on your ESX hosts like: Clash between snapshot (vml.xxx…x:1) and non-snapshot (vml.xxx…x:1) device. So to take advantage of this setup, you have to un-assign the replicated storage to both clusters. When disaster strikes, you have to re-assign the replicated storage to the surviving cluster before you can start recovering VMs.

Figure 6

Advantage:

  • No passive ESX hosts needed in both datacenters, but you might need extra capacity dependant on your disaster recovery requirements.
  • Active storage in both datacenters, so active disk-IO stays local to the datacenter.
  • Possibility to VMotion/Storage VMotion between sites manually.
  • You can utilize both VMware HA and VMware DRS without the drawbacks from scenario 3, because you have 2 separate clusters and both technologies operate at the cluster level.

Disadvantage:

  • Complex storage configuration and possibly error prone, because 2 different clusters share a common set of storage luns
  • You can’t assign both the active storage and its replicated counterpart to the same cluster as this would generate errors.
  • Disaster recovery becomes more complex as you have to re-assign the replicated storage to the surviving cluster before you can start recovering VMs.
  • VMotioning VMs to the other datacenter will results to all active disk-IO for that VM traveling across the WAN link and consequently impacting VM performance, because of the extra WAN latency. This action should always be followed by a storage VMotion to correct this, but one might forget.
  • You can’t split ESX clusters with less than 4 hosts, because that would result in a cluster that is not redundant. To resolve this issue, you need to add extra ESX hosts so that you have at least 2 servers in each datacenter.

Conclusion

Stretching an ESX cluster across datacenters (scenario 2 and scenario 3) is a bad idea, because you can’t use VMware HA or VMware DRS. VMware DRS currently doesn’t have any functionality that takes different datacenters/sites into account and would result in VMs running on the “wrong” side of the cluster compared to their storage. This is also true for VMware HA, where VMware HA might put all primaries on one side of the cluster. I even doubt if scenario 2 and scenario 3 are supported by VMware, so let’s just assume they’re not!

If you want to design your virtual infrastructure across datacenters I recommend you to choose scenario 1, scenario 4 or scenario 5. I would choose for scenario4 as this is a rather simple setup and doesn’t require any passive ESX hosts. Although things might be different in the future as VMware is continuously developing new features and maybe one day will provide something like site-affinity/awareness for VMware HA and VMware DRS. Time will tell…

If you have any other opinions, insights, ideas, options or comments, please share! I’m very interested in hearing from you.

Additional readings

Long-distance VMotion

http://blogs.vmware.com/networking/2009/06/vmotion-between-data-centersa-vmware-and-cisco-proof-of-concept.html
http://virtualgeek.typepad.com/virtual_geek/2009/09/vmworld-2009-long-distance-vmotion-ta3105.html
http://www.virtuallifestyle.nl/2009/09/vmworld-09-long-distance-vmotion-ta3105/
http://vinf.net/2009/06/30/long-distance-vmotion-heading-to-the-vcloud/
http://www.yellow-bricks.com/2009/09/21/long-distance-vmotion/
http://www.simonlong.co.uk/blog/2009/06/30/wan-vmotion-a-step-closer-to-a-private-cloud/ 

Stretched clusters / cloud

http://virtualgeek.typepad.com/virtual_geek/2008/06/the-case-for-an.html
http://rodos.haywood.org/2009/01/moving-workloads-split-cluster-or-cloud.html
http://thevirtualdc.com/?p=135
http://blogs.cisco.com/datacenter/comments/what_is_not_networking_for_the_cloud
http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns836/white_paper_c11-557822.pdf

Related posts:

  1. HOW-TO: Recover from failed Storage VMotion Tweet A while ago I received a request from the storage department to move a whole ESX cluster to another storage I/O-Group. This would be a disruptive action. I was...
  2. VMware Storage Sudoku Tweet Last Friday I was brainstorming with Gabrie van Zanten about the optimal placement of the VMDKs across our LUNs. We tried to come up with an algorithm that could...
  3. VMotion does not check vSwitch port availability Tweet When I was reshuffling some VMs from one cluster to the other, I had a VM that was cut off from the network. When I did some investigation I...
  4. ESX Memory Management – Part 2 Tweet In part 1, I explained the virtual machine memory settings that are available to us. This caused quite some controversy on what would happen when a limit is hit...
  5. ESX Memory Management – Part 3 Tweet In part 1, I explained the virtual machine settings that are available to us regarding to memory, in part 2, I explained how the ESX kernel assigns memory to...

12 Comments on “Geographically dispersed cluster design”

  1. #1 PiroNet
    on Nov 17th, 2009 at 12:07 pm

    Hi Arnim, great post. Internally we came up to the same conclusions. We will go for scenario#5 along with SRM. We also have a dark fiber between both datacenters and hopping soon to be able to vmotion accross DCs.

    One thought, it is not clear on some of your diagrams whether or not you use stretched VLAN (combined with stretched VMware cluster), it would be great if you add that info…

    Cheers,
    Didier

  2. #2 Arnim van Lieshout
    on Nov 19th, 2009 at 12:48 am

    Hi Didier,

    I mentioned that the first requirement is stretched vlans and that we assume that this requirement is in place. So this means for all design scenarios.
    I’ll see if I can put it in the graphs.

    -Arnim

  3. #3 Mario Lenz
    on Nov 27th, 2009 at 10:02 pm

    Hi Arnim!

    As far as I understand, you’re talking about active-passive solutions like Continous Data Access or MirrorView in order to mirror storage between datacenters. What about some kind of storage virtualization? Our VMware consultant recommended SAN Symphony from DataCore- a technician from HP recommended another product although I can’t remember what it was. FalconStore?

    Anyway: This would be an active-active mirror without the problems you mentioned above. However, this might be more expansive and would probably result in increased traffic between datacenters. Additionally, datacenters shouldn’t be too far apart.

    Nevertheless, I think this is an interesting option. Did you have any special reasons not to include this scenario?

    cu

    Mario

  4. #4 Alexander Thoma
    on Dec 13th, 2009 at 9:25 pm

    Hi Arnim,

    I must disagree to your conclusion regarding stretched clusters! :-)

    If you add a storage virtualisation technology like NetApp Metrocluster, Data Core or FalconStore into the picture and also assume you have sufficant network and storage bandwidth available, then you can clearly build very powerful and simple stretched clusters. Of course this will only work if latency between those two sites is acceptable low.

    I agree that this is a long list of “must have”, but at least I have some big customers running this sort of setup very successful.

    You point of “VMware official Support” is a nice one ;-) .

    Alexander

    P.S.: In this post I only express my personal oppinions and NOT an official comment by VMware.

  5. #5 Arnim van Lieshout
    on Dec 14th, 2009 at 5:43 pm

    Mario, Alexander,

    First of all thank you for your comments.

    When I was asked to look into stretched clusters there wasn’t any storage virtualization solution in place at that site and couldn’t introduce new technologies either.

    I wanted to share my thoughts about the options I came up with, without using a storage virtualization solution. Also those storage virtualization solutions aren’t in the field of my expertise (yet). So that’s one thing I have to look into, in the near future.

    Luckily the beauty of blogging is you guys commenting on my article. :-)

    -Arnim

  6. #6 Virtualization Short Take #33 - blog.scottlowe.org - The weblog of an IT pro specializing in virtualization, storage, and servers
    on Jan 7th, 2010 at 9:42 am

    [...] van Lieshout has a great post on geographically dispersed VMware clusters. One thought that occurred to me as I was reading this post was that while Arnim’s post was [...]

  7. #7 Chad Sakac
    on Jan 9th, 2010 at 4:03 am

    Disclosure – I’m an EMC employee.

    I’m of the same mind – that isolated clusters (options 4 and 5) are the best (there are a number of VM HA and DRS conditions as you noted). The other variations are valid of course, but the litmus test for me is:

    1) “do you really not care which side the VMs are on (ergo it’s a MAN or you have existing dark fiber), or are you willing to setup very specific, and relatively complex DRS exclusion rules”

    2) “are you willing to do the homework on the VM HA primaries” (as you noted)

    At VMworld in session SS5240, we covered what worked and was supported today, and we did a technology preview of whats coming in this area.

    On the VMware side, there’s a set of DRS and VM HA changes coming that will broaden the use cases; on the Cisco side, a series of technolgies that make the networking constraints looser; and on the EMC side, the ability to have an active-active storage model across long distances (maintaining the “writeable on both sides simultaneously” requirement for these shared datastore use cases – both for NFS and VMFS use cases).

    The design target is to enable the action to also not require all the data replicating all the time (optionally) – so the impact only occurs at the point of vmotion.

    Stay tuned – lots coming on this front soon!

  8. #8 Arnim van Lieshout
    on Jan 10th, 2010 at 4:52 pm

    Thanks for your reply Chad.

    Unfortunately I can’t view this VMworld session as I don’t have an account.

    Good to know that there’s more coming on this front from both EMC, Cisco and VMware.
    Can’t wait to take these new technologies for a testdrive!

  9. #9 Joep Leurs
    on Mar 21st, 2010 at 8:24 pm

    Hi Arnim,

    very good post.

    What about an IBM metro cluster ? Imho the storage is transparant in case of a campus solution. If the first datacenter on a campus fails, the storage fails over directly to the second datacenter. In this scenario you would have to make a cluster over two datacenters. Disadvantage is indeed that you cannot exceed 8 nodes in a cluster. Hence it is a campus design DRS would be fine, while using the campus infrastructure.

  10. #10 Chad Sakac
    on Mar 22nd, 2010 at 2:54 pm

    Disclosure – EMC employee here.

    Joep – NetApp Metrocluster (IBM resells Netapp as N Series) along with other approaches (some listed earlier in the thread) can make a datastore be presented at two geographically dispersed sites.

    There are good whitepapers on this topic, and there is a specific VMware/NetApp KB article on it.

    As we’ve been drafting the EMC doc for this use case we got into a furious internal debate. Some want to point out what’s technicallly possible, some wanting to recommend specific choices. I fall into the “recommend specific choices” category.

    While POSSIBLE to create a single stretched VMware cluster, I agree with Arnim’s conclusion.

    Until:
    1) VM HA has a more transparent way to define primaries/secondaries
    2) VM HA has a more “SRM like” ability to control restart conditions/sequencing
    3) VM DRS has “sub-cluster affinity zones”
    …I personally wouldn’t recommend a single stretched cluster across geo distances, as operational complexity outweighs the advantages (IMO).

    BTW – all these current gaps are actively being worked on.

    Generally what I’ve found is that people do this (geographically dispersed VMware) for two reasons:

    1) Disaster Avoidance
    2) Disaster Recovery

    vMotion between different VMware clusters is possible, so you can get Disaster Avoidance with two seperate clusters, and don’t run into all of the caveats.

    Use of VM HA response for disaster recovery is really a bad idea IMO. Not only does it lack the sequencing control, and the ability to test and report (DR testing is critical as anyone doing DR knows), and of course the fact that it’s designed for recovery from a small number of hosts failing, not half.

    In the end, this is usually because someone doesn’t want to spend the $ to license Site Recovery Manager, or because a storage vendor is jockeying for position (this makes for a very cool demo, and customers dig cool demos), or to take the customer revenue rather than partnering with VMware on a Site Recovery Manager solution.

    Just because something is POSSIBLE, doesn’t mean it should be done, and often infrastructure-level (alone) solutions aren’t enough.

  11. #11 VMware HA Agent – Lack and Limitation « DeinosCloud
    on Mar 23rd, 2010 at 5:59 pm

    [...] Data Centers will be common designs sooner that we think and as Chad Sakac posted in response to a blog from Arnim Van Lieshout, HA [...]

  12. #12 Lenny Burns
    on Nov 9th, 2010 at 7:10 pm

    @Chad,

    I get why you say “In the end, this is usually because someone doesn’t want to spend the $ to license Site Recovery Manager”, but in my client case, $ isn’t a factor.

    It’s how to quickly re-ip the infrastrcuture and bring it online.

    How else, other than spanning the L2 across both DC’s is this quickly accomplished?

Leave a Comment