Automating Disaster Recovery Process for Large Retail Chain

Gloe-cloud-scaled

SITUATION AND BUSINESS CHALLENGE

An American retail chain’s site reliability engineering team was facing difficulty in automating the disaster recovery (DR) process for its mission-critical point-of-sale (POS) application serving thousands of stores in North America. The POS system supported in-store order, payment, and customer loyalty program functionalities, as well as communicating related data between stores and the company’s primary datacenter, located on the West Coast.

The effort was an essential component of the company’s “bulletproof” business continuity initiative to strengthen its core business applications to a point of zero downtime. The application site reliability engineering team working on the project, however, was facing several obstacles:

  • Configuration mismatches between the primary datacenter and a backup datacenter on the East Coast, whose application servers and databases were not patched or updated regularly.
  • Lack of clear ownership to drive the process end-to-end.
  • Excessive resources working on the project, at times involving more than 20 workers from different teams.
  • Additional complexity of performing DR processes with company-owned datacenters versus cloud datacenters.

The team had attempted to failover between the two datacenters twice in the past year, reverting to the pre-failover state each time. The efforts involved as many as 108 steps over the course of 12–14 hours.

Requiring fresh insight and deeper knowledge to address the problem, the company searched for a consulting partner to lead the effort. Based on significant trust earned by AIM Consulting from many engagements including one to replatform the POS application, the company turned to AIM’s Cloud & Operations practice to lead the project.

SOLUTION

An AIM senior project manager and two site reliability engineers successfully developed and tested a robust DR process that dramatically reduced the number of steps and time to complete.

Working closely with business and IT stakeholders throughout the engagement, AIM guided the site reliability engineering team through the engagement in several key phases:

Architecture review and gap analysis

AIM began with a thorough platform architecture assessment to determine how fault tolerance and redundancy were already configured across the network. The AIM consultants performed updates and provided recommendations for best practices to fine-tune web services, databases, firewalls, load balancers and other system components, then reviewed the site reliability engineering team’s failover process in full. The assessment and gap analysis revealed the configuration mismatch and lack of readiness for the backup datacenter to handle the failover, and provided the foundation for the technical reconciliation between the two datacenters.

Testing and validation

AIM created a new performance pre-production/staging environment that mirrored the hardware and configuration in the production environment, something the site reliability engineering team had not created before. The environment tested the load and validated the failover scenarios for interchanging database copies between the two datacenters.

The extensive testing included every internal team that could be affected during a failover, including the network team, financial teams with systems connected to the POS, and end-users such as baristas to confirm the expected POS behavior at the store level. The wide involvement helped to raise awareness and educate teams on additional work they needed to perform before the official failover test.

Enriching the failover process with scripting and automation wherever possible in the testing phase, AIM reduced the number of steps in the procedure from 108 to 64.

Successful failover test

With the solidified DR procedure and a high confidence level in place, AIM and the site reliability team performed a successful test of the failover process. Business and IT executive management watched closely as the failover was executed in the span of just four hours, a two-thirds reduction in time from the previous benchmark. No helpdesk calls were received, no financial data was lost during the test, and no stores encountered errors in the POS application.

The DR process flow

If the West Coast datacenter should suffer a catastrophic event, an automated process is triggered that enables all application servers and systems in the East Coast datacenter, and the network team then routes all inbound traffic to the East Coast. The database in the East Coast datacenter goes live as the primary database, and a copy is simultaneously made for redundancy.

After a system health check, a determination would be made of how much data was lost during the process; a small data loss is expected during a catastrophe, but the amount in this case would be minimal. Around 400,000 transactions were waiting in the queue during the four-hour failover test, and in a real-life catastrophe perhaps 1 to 5 minutes of these transactions could be lost.

RESULTS

Stakeholders were thrilled with the engagement and the accompanying gains in business continuity. With the operationalized DR solution in place, the platform can handle any significant disaster in a given region with little or no impact to POS business operations. The site reliability engineering team has subsequently tested the DR procedure twice, encountering no errors.

With the POS failover process in place, stakeholders are moving forward with DR projects for more core business applications, leveraging the success and template from the AIM engagement.