Cloud Infrastructure: When is a Dual Data Centre Solution a Necessity?
Organisations, whose functions are classified as “system relevant” by regulators (hospitals, banks etc.), suppliers to these organisations and those who provide important essential services, need to meet special criteria for business continuity: they need to be up and running again very quickly. In this blog post, we provide answers to the most important questions related to business continuity for cloud systems and services.
Organisations have to think about what it means if their cloud services go down and how long they would need to be up and running again. These scenarios often involve unlikely events like a plane crashing on a data centre leading to a significant outage in cloud services. Depending on an organisation’s business and the service involved, it might be perfectly acceptable for the organisation to recover their service within up to 1 week in some cases or within 1 hour in others.
In many cases, organisations cannot afford long downtimes like one week, even in case of a disaster and need to plan to be able to continue their services within hours. In these cases, a dual data centre solution will provide the organisation with a quick recovery solution. There are many questions related to business continuity requirements and in this blog post we take a look at the most important ones and provide answers.
Here are some of the most frequently asked questions relating to business continuity services for enterprise clouds:
- How do we make sure that the computing solution is available, for example, greater than 99.9% of the time?
- How to we guard against data loss, e.g. when a storage medium or server has a defect?
- How can unintentionally deleted data be restored?
- How can a complete server be restored?
- How can the applicable KPIs (RPO / RTO) be met?
- How can business continuity be ensured within a reasonable amount of time, e.g. if a server fails, or there is an electricity outage?
- How can business continuity be ensured in case of a major disaster affecting a whole data centre (e.g. flooding, fires, plane crash etc.)
In this blog post, we will provide answers to these questions.
Learn how to ensure business continuity with Safe Swiss Cloud
How do we make sure that the computing solution is available greater than 99.9% of the time?
Technically, redundancy at every possible level is the answer. The goal is to make sure that hardware defects do not lead to an outage. For example, if servers have dual independent power supply units served by independent power, then the system will survive a defect in the power supply unit of the server or a failure in one of the power supply lines. The same applies to storage (see the next section), networking (switches), internet connections, cooling etc.
In case of a solution hosted by a provider like Safe Swiss Cloud, a Service Level Agreement (SLA) for availability for the needed percentage of time (for example greater than 99.9%) can be entered into.
How do we guard against data loss, e.g. when a storage medium or server has a defect?
If there is a defect in a storage component like an SSD or a storage server, can the system continue to operate without interruption? The answer is a redundant storage cluster, with redundancy for example at the level of SSDs and at the level of storage servers. If some SSDs have a defect, processing continues without interruption, because the data is retrieved from elsewhere in the cluster. The same applies, if a whole storage server with multiple SSDs is not available.
A redundant, clustered storage system makes sure, that
- There is no data loss
- There is no interruption in computing
Redundant, clustered storage systems deliver this by storing data multiple times in a storage system, so that if one part of the storage system has a defect, the data is still available from another part of the system.
How can unintentionally deleted data be restored?
A dual data centre system alone will not protect against deleted data: the dual data solution replicates everything from one data centre to the other including deletions or potential other data issues arising for example out of software bugs.
Regular backups of the data need to be made in order to be able to recover from data problems. Anything deleted unintentionally by users, can be restored from the regular backups. Backups need to be made at least daily to an independent storage medium.
An additional goal of a backup is to enable a complete restore of a server, if the server stops functioning as expected, for example, as the result of a software upgrade.
Safe Swiss Cloud’s backup systems by default write backup data to a different data centre from the one where the compute workload is running.
How can a complete server be restored?
To guard against the danger of a server not functioning as expected, for example, after a software upgrade, Safe Swiss Cloud allows customers to make a “snapshot” of the server. If the results of the server upgrade are not as expected, the server can be restored very quickly from the snapshot with a couple of clicks.
It is worth noting, that snapshots are a time and data consistent “snap” of all the storage on a server – this is something that not all backups are able to deliver.
How can the applicable KPIs (RPO / RTO) be met?
What is the RPO – Return Point Objective? The RPO is a number expressed in time units, which states that you should be able to restore to the state in which the data was at least the RPO amount of time before the occurrence of an incident. It specifies that more than the data collected during the RPO period can get lost in case of an unfavourable event.
What is the RTO – Return Time Objective? The RTO is a number, which expresses the maximum amount of time needed to restore to the RPO.
Generally, a combination of regular backups and in some cases, snapshots will take care of this. Specific applications may need specific procedures, for example for restoring databases. Restore tests, especially of the biggest systems are needed to have realistic estimates of the RTO.
How can business continuity be ensured within a reasonable amount of time, e.g. if a server fails, or there is an electricity outage?
The answer to this is to have sufficient redundancy, at least within the same data centre. If a server fails, the computing must be transferred to a different server.
In case of an electricity outage, the server should have a redundant electric power supply which takes over automatically.
In addition, data centres should have Uninterruptable Power Supplies (UPS), which compensate for voltage fluctuations or in case of a major disaster, give the data centre time to start their diesel generators, which should be able to provide power independently for weeks.
How can business continuity / disaster recovery be ensured in case of a major disaster affecting the main data centre (e.g. flooding, plane crash etc.)?
In this case, a dual data centre solution is what is needed. Typically, this is a requirement for “system relevant” infrastructure like hospitals or banks. These sorts of customers are required by their regulators to ensure they have taken all the steps needed to ensure that their computing capabilities are up and running within a short time (the RTO), after an outage due to a disaster event like a natural disaster (e.g. floods), a man made disaster (a plane crashing into a data centre) or something unforeseen.
As a first step, a detection of a disaster event, usually manifested in the form of an outage of one of the data centres, is necessary. Once a disaster event has been detected, steps have to be taken to ensure that the computing can continue in a different data centre.
The important considerations for a customer designing a dual data centre solution are:
- How does the synchronisation of data to a standby data centre happen? Is this performed by the cloud system or does the customer have to setup synchronisation themselves? Does this happen synchronously or asynchronously?
- Is the detection of the event manual or automatic?
- Does the switch over to the standby data centre happen automatically?
The advantages of a solution with automatic failover:
Safe Swiss Cloud operates a dual data centre solution, which:
- Synchronises data between the data centres in real time: whenever a program writes something to storage, it is synchronously written to both data centres.
- An outage in the primary data centre is automatically detected by the cloud system and brings up all the customer’s servers in the standby data centre automatically.
The beauty of the solution is, that the servers retain exactly the same internal and external IP addresses in the new data centre, so that no manual interventions in the form of changing DNS addresses or other configurations are needed.
Conclusion:
A high degree of redundancy in a single data centre is sufficient to ensure business continuity for most computing workloads. This provides sufficient protection against a defect in a compute server, storage server, storage medium, power supplies, network switches, routers and Internet connections.
For system relevant companies and organisations, which need to guarantee short recovery times, a dual data centre solution is needed to ensure disaster recovery within a very short time and make sure they comply with the requirements of their regulators.
Dual Data Center Cloud
Maximum availability in Dual Data Center cloud for mission critical computing
Great reading and extremely comprehensive post – pretty much covers everything…