FLASH INFORMATIQUE FI



A Disaster Recovery Plan for a Private Cloud Based Data Center


We provide a Disaster Recovery solution design for a private-cloud based data center. The production site is at EPFL - Middle East in RAK. The DR site is hosted by Ankabut in Abu Dhabi, 250 km away from the production site. During disaster simulation, we were able to achieve a failover with a Recovery Time Objective of less than 10 minutes.


Nous fournissons une solution de site de secours pour un centre de données basé sur un cloud privé. Le site de production est à l’EPFL – Middle East à RAK. Le site de secours est hébergé chez Ankabut à Abu Dhabi, à 250 km du site de production. Lors d’une simulation d’un désastre, nous avons pu rétablir les services en moins de 10 minutes à travers le site d’Abu Dhabi.



Alaeddine EL FAWAL


Introduction

In this paper, we describe the Disaster Recovery (DR) solution we implemented at Ecole Polytechnique Fédérale de Lausanne - Middle East (EPFL-ME). EPFL-ME is a graduate research center of EPFL established in RAK-UAE (Ras Al Khaimah – United Arab Emirates). The DR site of EPFL-ME is hosted at the Ankabut (UAE Advanced Network for Research and Education) data center in Abu Dhabi, 250 km away from the production site in RAK. Both sites are connected through a 1GB leased line over the Ankabut backbone.
Our plan aims to prepare for recovering data and IT services after a natural or human induced disaster. We set a Recovery Time Objective (RTO) [1] to be upper-bounded by half an hour. To meet this objective, our design should have some mandatory characteristics such as:

Simple design:
The Disaster Recovery plan should have simple design and architecture, and complexity at hardware level should be avoided as much as possible.
Standardized solution:
The plan should provide a standardized solution for all of our services. That is because piecing together a comprehensive solution from separate components is a complex and laborious process as it involves enormous research and testing effort to deploy these components and integrate them together. Moreover, it usually requires working with multiple vendors.
Centralized management:
The design should offer a centralized management that provides the ease of detecting changes on the production site, of controlling and monitoring replication jobs and executing Failover [2]/Failback [3] processes.
Testing:
The solution should offer an easy testing process, which should be carried out regularly to ensure that changes in the production infrastructure are reflected in the Disaster Recovery plan.
Remote control and monitoring:
As permanent staff on the recovery site has prohibitive costs, remote control and monitoring of harware must be provided. In particular, it should be possible to boot the system remotely.
Transparent failover:
During a failover, the IT services are provided from different machines, different networks and a different location. If this is not transparent, we have to make the right changes for every single user. This is not possible if we have hundreds of users.
Efficient security measures:
Given that the Disaster Recovery site is hosted by a third party, an internal policy imposes to have encrypted storage. Further, the Disaster Recovery design should secure data transfer and all operations between both sites. These measures should be as efficient and transparent for the remaining components as possible to avoid increasing the overhead and the RTO.

A Disaster Recovery solution should take into consideration the system on the production site. In our case, our infrastructure is a private cloud. Further, we are uniforming our services around the Microsoft platform. Therefore, an appropriate replication technology - that replicates data to the DR site - should understand the virtual environment architecture and be able to deal with its components. Moreover, it should take advantage of a homogeneous pool of operating systems. As we will see later, a combination of a private cloud and an appropriate replication technology brings important advantages to the disaster recovery plan, and helps reducing the RTO. Our choice of a replication technology was for Veeam Backup and Replication (hereafter Veeam), for the reasons explained in Sect. Technology Specifications and System Architecture.
Selecting the appropriate replication technology is not enough to deliver a low RTO. Other components should be well chosen as well, too. If the network design does not provide a quick switching to the Disaster Recovery site, the downtime of the service would be hours, even days. If remote access to the Disaster Recovery site is not well provisioned, launching a failover requires traveling 250 km to reach the site. Therefore, beside the replication technology, we need to carefully identify all other components and incorporate them in an integrated design that allows a smooth interaction among them.

Advantages of a Private Cloud for an Efficient DR Plan

Our production IT infrastructure is a private cloud, where all our servers and IT services are virtualized. The private cloud is based on VMware vSphere 5. The advantage of such an infrastructure is that it removes all dependencies of the applications on the hardware. It offers an efficient use of the hardware, where several Virtual machines (VM  [4]) can be hosted on the same physical machine to maximize its utilization. It offers high flexibility, where adding or deleting a VM is a matter of few clicks. Moreover, it provides centralized management of all deployed servers through vCenter  [5]. With our private cloud, using a traditional replication technology designed originally for physical servers would result in suboptimal performance. Therefore, we restricted our search for a replication technology to those that understand the VMware virtual environment and are designed to deal with its components such as VMs and vCenter  [6]. This combination of a private cloud with an appropriate replication technology offers several advantages, some of them are:

Bare-metal backup:
A traditional way for Disaster Recovery is to replicate user data to the romote site. In case of a disaster, all operating systems should be installed first. Then, the applications are installed and configured. In the end, the replicated user data are imported to the applications. Given the complexity of this process, it is very timely and results in an RTO of hours and even days. With a technology that deals with VMs, the whole VM is replicated, which includes the operating system, the applications, and the user data. Note that replicating the whole VM is allowed because, contrary to physical servers, it is hardware-independent. All it takes to recover an application in case of a disaster is to boot the corresponding VM replica, which is a matter of few minutes – the time it takes to boot.
Hardware independent:
Contrary to storage based replication that requires the same storage hardware on both sites, this solution is completely hardware independent. In our case, we deploy a Storage Area Network (SAN) on the production site, whereas internal storage (directly attached to the server) is deployed on the DR site.
Flexibility:
We can select at any time the VMs that we want to replicate.
Centralized replication and management:
With traditional replication solutions, there is one replication process per application, and they are independent and use different tools and technologies. The more complex and numerous the tools that the organization will need to recover after a natural disaster, the lengthier the plan will be, the more difficult it will be to update, maintain and test; and the less likely it will be to be followed correctly when needed. However, a technology that is dedicated to virtual environments connects to vCenter and provides a standardized solution for the replication of the whole environment, including all applications and services running within it. This results in only one process and only one standard for replication for all services (see further Failover /Failback Steps). Additionally, it provides a centralized management and administration of replication jobs and failover.
Testing:
With bare-metal based replication, we replicate VMs, which include the Operating Systems, the applications and the user data. Testing the replication is very easy and it consists of simply booting the VM replicas. Further, this does not interrupt running service, as the VM replicas are in an isolated environment on the remote site (see further).
Low RTO:
Using traditional solutions for Disaster Recovery, such as storage based ones, results in high complexity for failover: first, the storage at the remote site should be switched to Read/Write mode, then Logical Units Numbers (LUN) should be mounted to the physical servers, and the VMs should be added one by one to the inventory of the virtual environment. This requires highly specialized storage expertise. However, with VM-based replication, all applications running on a VM are mirrored with the same configuration to the VM replica. In order to recover an application, all it takes is to boot the corresponding replica.

Despite all these advantages, this does not guarantee a small RTO. Other DR components must be well chosen to fully take advantage of them, too. For instance, if the network design does not allow a quick and smooth switching to recovery services, or if the remote access to the recovery site is not provided, or if the failover requires configuration changes for all users, the RTO will be a matter of hours and even days.

Technology Specifications and System Architecture



Figure 1 – System architecture. Veeam replicates VMs from one virtual environment to another. We installed a backup server on the DR site and a proxy on the production site. They establish a stable connection between them to handle replication traffic. The proxy processes and compresses data (deduplication) and sends them to the server, which decompresses the data and sends them to the target ESXi host.

In this section we describe key features of Veeam which is a two-in-one solution for backup and replication. Veeam a two-in-one solution for backup and replication. Veeam is intended only to virtual environments. In particular, it offers full support to VMware vSphere environments. One of its main features is the ability to create transactionally consistent backups; It supports Windows Volume Shadow Copy Service (VSS [7]), enabling backup and replication of live systems running Windows applications or working with databases (for example, Domain Controller, Exchange Server, SQL Server and MySQL server). This ensures a successful recovery of these applications. Veeam also offers a Wide Area Network (WAN) optimization option that decreases the size of VM backups using deduplication and compression. It implements vPower technology with which a VM can be started directly from a compressed backup file. This process is called instant recovery.
Finally, Veeam implements mechanisms for real-time statistics and analysis. Two main components of Veeam solution are the Veeam backup server and the backup proxy.

  • The Veeam backup server is a Windows-based physical or virtual machine on which Veeam is installed. It is the core component in the backup/replication infrastructure that plays the role of the configuration and control center. The Veeam Backup server coordinates all replication job activities and handles data traffic itself.
  • The backup proxy is an architecture component used to assist the Veeam backup server and allows a distributed architecture for job processing and replication traffic delivery. As the data handling task is assigned to the backup proxy, the Veeam Backup server becomes a point of control, dispatching jobs to proxy servers.

Replication Process

Figure 1 shows the system architecture we adopted for replication. Veeam requires another virtual environment at the DR site. It connects to vCenters or ESXi hosts of production and DR sites, and replicates VMs from one site to another. In our scenario, we dedicated one VM for the backup proxy on the production site and another VM for Veeam backup server on the DR site. Replication jobs are defined in the Backup server. They consist of selecting VMs from the inventory of the production vCenter/ESXi hosts, for replication to the inventory of the DR vCenter/ ESXi hosts.
While executing a job, the backup server and the proxy establish a stable connection over the WAN to handle replication traffic. The proxy retrieves VM data from the production storage, compresses it and sends it to the backup server. The backup server receives these data, decompresses it and sends it to the target ESXi hosts on the DR site.
When creating a replication job, we have to indicate its schedule, which defines the Recovery Point Objective (RPO [8]). For VMs that do not imply quick change, we set daily basis schedules. And for the file server, we set a schedule of half hourly basis, which provides an RPO of 30 minutes. This small RPO is permitted because of our fast connection to the DR site.
In order to minimize the amount of traffic going from the production site to the DR site over the WAN, we used replica seeding options offered by Veeam. Before installing the storage on the DR site, we brought it to the production site, where we created backup copies of required VMs on this storage. Then, we took the storage to the DR site and used these copies for replica seeding. Hence, only incremental backups travel over the WAN during a replication job.

Failover Process

For failover in a disaster situation, we need to boot the replica VMs on the DR site and the network architecture should be designed to give access to services running on these replica VMs, instead of on the original VMs.

Network Design



Figure 2 – Network architecture for the DR plan. The Network is organized in server subnets and user subnets. Only server subnets are recreated on the DR site. In the normal situation, the server subnets on the DR site are not accessible from the production site. In a disaster situation, routes are added to the router on the production site so that users are able to access these subnets but not those at the production site. VLAN [9] 200 subnet is dedicated for the DR site and it is accessible all the time from the production site. The Veeam backup server is on subnet VLAN 200. Both sites communicate over an IPSec VPN tunnel [10]. The DR site is accessible from anywhere in the internet through SSL VPN tunnel [11], which constitutes a redundant secure access solution, in case the IPSec VPN tunnel fails. VM replicas within the same VLAN on the DR site can communicate with each other, as no routing between them is required. This is essential for testing. These IP addresses do not reflect the real addresses used at EPFL-ME.

Both sites are connected through an IPSec VPN tunnel (see Figure 2). This allows us to access private IP addresses on both sites. Moreover, this VPN tunnel ensures a secure replication and flexible communication between both sites.
One of the main issues we faced in this design was how to assign IP addresses to the server replicas in the DR site in a way that ensures a smooth failover/failback and remains transparent for the end-user. Veeam offers the ability to automatically map subnets from the production site to other subnets at the DR site. This would allow the servers and their corresponding replicas to have different IP addresses. However, opting for this option would require maintaining two Domain Name Servers and would pose problems for services that interact with each other using directly their IP addresses. Moreover, the failover would not be transparent for the end-user.
Therefore, we decided to keep the same IP addresses for servers and their replicas and change the routing tables in case of failover. To do so, we organized the network into server subnets and user subnets. Only server subnets are recreated on the DR site. In the normal situation, the server subnets on the DR site are not accessible from the production site. In a DR situation, we move the server subnets to the DR site, and packets sent to these subnets are routed through the IPSec VPN tunnel (see Figure 2). In our case, we maintain a patch to make the routing configuration change for a failover/failback. This patch is applied to the access routers that run the IPSec tunnel. Applying the patch is a matter of only few seconds.
The advantages of our design are:

  • Complete transparency for the end-user
  • Complete transparency for all services
  • Smooth and quick failover/failback
  • Less maintenance overhead.

Note that our design does not consider a Demilitarized Zone (DMZ [12]), as our website is hosted at a third party. This issue could be resolved by moving the DMZ VMs to the DR site in case if a disaster, and adding an application-layer proxy on the production site that forwards packets to DR DMZ. This needs more investigation to assess its performance.

Secure design

Data security is an integral part of this design. We consider network security and storage security. As to network security, we implemented the following mechanisms:

  • Replication and communication between both sites occurs over an IPSec tunnel, which encrypts all exchanged packets.
  • If the IPSec tunnel fails for any reason, a remote access to the DR site is provided through a user-based SSL tunnel, where communication is also encrypted.
  • All access to network equipment occurs using the SSH [13] protocol: it ensures an encrypted communication and it authenticates users using challenges instead of sending clear passwords over the network.

As to storage security, following an internal policy, all data stored at a third party that has physical access to the storage drives should be encrypted. Therefore, we opted for self-encrypted disks (SED), a hardware based solution. It is superior to software-based ones mainly for the following reasons:

Transparency:
it is transparent for users and applications.
Management:
it is easy to manage.
Performance:
there is no degradation in SED performance, as the encryption is hardware-based.

With SEDs, the encryption method and key are defined at the BIOS level.

Remote Control and Monitoring

Due to the long distance between the production site and the DR site, remote control and monitoring is a must. This should take in consideration access to the physical server, access to network equipment, and access to the site including all its virtual servers and services. Hence, we consider:

Remote monitoring and management of the physical server::
Our server is a PowerEdge DELL server with iDRAC7 [14]express. It supports web browser-based graphic user interface access, which allows power monitoring and provides diagnostic features such as crash screen capture, and local and remote configuration and update. It even allows switching the physical server on remotely, in case it was off
Site access:
Remote access to the DR site from the production site is enabled using the IPSec site-to-site VPN tunnel between both sites. If, for some reason, this tunnel fails, or we would like to access the DR site from anywhere else, an SSL user-based VPN tunnel is provided. This allows us to work as if we were on the DR site.

Failover / Failback Steps

With our design, the failover is very simple. It is consists of three steps:

  1. Apply a patch on the access router on the production site (less than one minute). This patch consists in changing the route to the server subnets from the production site to the DR one.
  2. Apply the patch to the access route of the DR site (less than one minute). Again, this patch consists in changing the route to the server subnets from the production site to the DR one.
  3. Connect to the Veeam backup server on the DR site and launch the failover process for required VMs. The Veeam server will boot the required VMs. In few minutes (booting time), the services will be accessible from the production site.

For failback, the process is just the opposite:

  1. Connect to the Veeam server and launch the failback process for required VM replicas. This process will replicate changes occurred on the VM replicas to the original VMs on the production site and then it will boot them.
  2. Apply a patch on the access router on the production site (less than one minute). This patch consists in changing the route to the server subnets from the DR site to the production one.
  3. Apply a patch to the access route of the DR site (less than one minute). Again, this patch consists in changing the route to the server subnets from the DR site to the production one.

Conclusions

The DR plan deployed at EPFL-ME in RAK takes advantage of the production site particularities: a private-cloud based site with a homogeneous server pool. This particularity helped us deliver an RTO shorter than 10 minutes with a generic and complete solution which allowed us to have a recovery site hosted 250 km away from the production site. This distance required particular attention to the network design and remote access and control to the DR site. Our solution is centrally managed and is extensible to meet the needs of larger setups.

Acknowledgment

I would like to thank Fabien Figueras for the nice discussions that we had regarding the replication technology. Also, I would like to thank Alain Gremaud and Eric Krejci for their advice regarding the replication of a Domain Controller. Finally, I would like to express my gratitude to Jan Overney and Vittoria Rezzonico for editing this article.

[1] (Recovery Time Objective): the duration of time required to recover IT services after a disaster.

[2] the process of switching to the disaster recovery site in case of failure of the production site.

[3] the process of restoring a system in the production site, after a failover, back to its original state (before failure).

[4] (Virtual Machine): a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. It execute a complete operating system with application installed within it.

[5] architecture that provides a centralized and extensible platform for managing virtual infrastructure.

[6] architecture that provides a centralized and extensible platform for managing virtual infrastructure.

[7] (Windows Volume Shadow Copy Service): allows the creation of consistent backups of a volume by creating a read-only copy of the volume, and enabling backup programs to access every file without interfering with other programs writing to those same files.

[8] (Recovery Point Objective): the maximum tolerable period in which data might be lost from an IT service due to a major incident.

[9] (Virtual Local Area Network): A single layer-2 network may be partitioned to create multiple distinct broadcast domains, which are mutually isolated so that packets can only pass between them via one or more routers; such a domain is referred to as a VLAN.

[10] extends a private network across the Internet. It enables two sites to exchange data across shared or public networks as if they were directly connected to each other, while benefitting from the IPSec security scheme protecting data flows between both sites.

[11] similar to IPSec VPN Tunnel, but it uses the Secure-Socket Layer instead of IPSec and user-to-site instead of site-to site.

[12] (Demilitarized Zone): a subnet that contains and exposes an organization’s external-facing services to the Internet.

[13] (Secure Shell): a cryptographic network protocol for secure data communication.

[14] (integrated Remote Access Controller): extension card installed in the server that supports web browser-based graphic user interface access, which allows power monitoring and provides diagnostic features such as crash screen capture, and local and remote configuration and update. It even allows switching the physical server on remotely, in case it was off.



Cherchez ...

- dans tous les Flash informatique
(entre 1986 et 2001: seulement sur les titres et auteurs)
- par mot-clé

Avertissement

Cette page est un article d'une publication de l'EPFL.
Le contenu et certains liens ne sont peut-être plus d'actualité.

Responsabilité

Les articles n'engagent que leurs auteurs, sauf ceux qui concernent de façon évidente des prestations officielles (sous la responsabilité du DIT ou d'autres entités). Toute reproduction, même partielle, n'est autorisée qu'avec l'accord de la rédaction et des auteurs.


Archives sur clé USB

Le Flash informatique ne paraîtra plus. Le dernier numéro est daté de décembre 2013.

Taguage des articles

Depuis 2010, pour aider le lecteur, les articles sont taggués:
  •   tout public
    que vous soyiez utilisateur occasionnel du PC familial, ou bien simplement propriétaire d'un iPhone, lisez l'article marqué tout public, vous y apprendrez plein de choses qui vous permettront de mieux appréhender ces technologies qui envahissent votre quotidien
  •   public averti
    l'article parle de concepts techniques, mais à la portée de toute personne intéressée par les dessous des nouvelles technologies
  •   expert
    le sujet abordé n'intéresse que peu de lecteurs, mais ceux-là seront ravis d'approfondir un thème, d'en savoir plus sur un nouveau langage.