Disaster Recovery Automation Project Using Ansible and AWS

Task Overview

Disaster recovery (DR) is crucial for ensuring business continuity in case of failures, cyberattacks, or natural disasters. This project aims to automate backup and recovery of critical infrastructure and databases using Ansible, AWS, and Terraform. The solution will periodically back up infrastructure and databases and provide an automated mechanism for restoring services in case of failure.

Project Architecture

Key Components

Infrastructure as Code (IaC) - Terraform to provision AWS resources.
Configuration Management - Ansible playbooks for backup and restoration.
Database Backup and Restore - Automate RDS and on-premises MySQL/PostgreSQL backups.
Storage & Archiving - Store backups in Amazon S3 with lifecycle policies.
Monitoring & Alerting - Use CloudWatch and SNS for alerts.
Testing Disaster Recovery - Simulate failure scenarios and validate recovery.

Step-by-Step Implementation

Step 1: Infrastructure Provisioning with Terraform

Provision critical AWS resources:

EC2 instances for application servers.
RDS Database (MySQL/PostgreSQL).
S3 Buckets for storing backups.
IAM Roles & Policies for backup automation.
CloudWatch Logs for monitoring.

Step 2: Automating Backups Using Ansible

Create Ansible playbooks to:

Take snapshots of EC2 instances.
Backup RDS databases to S3.
Archive configuration files from servers.

Step 3: Disaster Recovery Simulation

Simulate an outage by terminating instances.
Use Ansible to restore from the latest backup.
Validate service restoration.

Implementation Details

1. Terraform Code to Deploy AWS Infrastructure


resource "aws_s3_bucket" "backup_bucket" {
  bucket = "my-disaster-recovery-bucket"
  lifecycle_rule {
    id      = "auto-expire"
    enabled = true
    expiration {
      days = 30
    }
  }
}

resource "aws_rds_instance" "database" {
  engine         = "mysql"
  instance_class = "db.t3.micro"
  allocated_storage = 20
  identifier     = "dr-db-instance"
  backup_retention_period = 7
}

resource "aws_instance" "app_server" {
  ami           = "ami-12345678"
  instance_type = "t2.micro"
  tags = {
    Name = "App-Server"
  }
}

2. Ansible Playbook for Backup Automation


---
- name: Backup EC2 and RDS
  hosts: localhost
  tasks:

    - name: Take EC2 snapshot
      community.aws.ec2_snapshot:
        instance_id: "{{ ec2_instance_id }}"
        region: "{{ aws_region }}"
        wait: yes
      register: ec2_snapshot

    - name: Backup RDS database
      community.aws.rds_snapshot:
        db_instance_identifier: "{{ rds_instance }}"
        db_snapshot_identifier: "rds-backup-{{ ansible_date_time.epoch }}"
        wait: yes
      register: rds_backup

    - name: Copy application data to S3
      aws_s3:
        bucket: "my-disaster-recovery-bucket"
        object: "/backups/app-data-{{ ansible_date_time.epoch }}.tar.gz"
        src: "/var/www/html/"
        mode: put

3. Ansible Playbook for Disaster Recovery


---
- name: Restore EC2 and RDS
  hosts: localhost
  tasks:

    - name: Restore EC2 from snapshot
      community.aws.ec2_snapshot_info:
        snapshot_ids: "{{ latest_snapshot_id }}"
      register: ec2_snapshot_info

    - name: Create new EC2 from latest snapshot
      community.aws.ec2_instance:
        name: "Recovered-App-Server"
        region: "{{ aws_region }}"
        image_id: "{{ ec2_snapshot_info.snapshots[0].image_id }}"
        instance_type: "t2.micro"
        wait: yes

    - name: Restore RDS from latest snapshot
      community.aws.rds_instance:
        identifier: "recovered-db-instance"
        snapshot_identifier: "{{ latest_rds_snapshot_id }}"
        instance_class: "db.t3.micro"
        wait: yes

Step 4: Monitoring & Alerts

1.Configure CloudWatch Alarms:

Trigger SNS notifications if EC2 or RDS fails.

Set up Ansible handlers:

Automatically restore services in case of failure.

---
- name: Monitor EC2 and Trigger Recovery
  hosts: localhost
  tasks:

    - name: Check if EC2 instance is running
      shell: aws ec2 describe-instance-status --instance-id "{{ ec2_instance_id }}"
      register: ec2_status

    - name: Trigger recovery if EC2 is down
      command: ansible-playbook restore.yml
      when: "'running' not in ec2_status.stdout"

Step 5: Testing the Disaster Recovery Plan

Scenario 1: Application Server Failure

Terminate EC2 manually.
Run Ansible recovery playbook.
Validate application accessibility.

Scenario 2: Database Failure

Delete RDS instance.
Run Ansible recovery playbook.
Validate database integrity.

Final Outcome

✅ Automated EC2 and RDS Backups
✅ Scheduled Backups to S3
✅ Automated Recovery Playbooks
✅ CloudWatch Monitoring & Alerting
✅ Tested DR Scenarios

Next Steps

Integrate backups with AWS Backup for centralized management.
Automate failover to Multi-AZ RDS and Auto Scaling.
Implement Cross-Region Replication for high availability.