Introduction
We are committed to delivering undisrupted service that allows our clients and users to maintain the continuity of their operations when using Sitepass (now part of INX Software Pty Ltd). Our business continuity plan outlines the:
- Measures we take to mitigate risks to our business
- Technical measures we take to protect our data, and
- Technical and administrative measures we take to recover from a disaster
Policy statement
Corporate management has approved the following policy statement:
- The company shall develop a comprehensive disaster recovery plan.
- A formal risk assessment shall be undertaken to determine the requirements for the disaster recovery plan.
- The disaster recovery plan should cover all essential and critical infrastructure elements, systems and networks, in accordance with key business activities.
- The disaster recovery plan should be periodically tested to ensure that it can be implemented in emergency situations and that the management and staff understand how it is to be executed.
- All staff must be made aware of the disaster recovery plan and their own respective roles.
- The disaster recovery plan is to be kept up to date to take into account changing circumstances.
Objectives
The principal objective of the disaster recovery program is to develop, test and document a well-structured and easily understood plan which will help Sitepass recover as quickly and effectively as possible from an unforeseen disaster or emergency which interrupts information systems and business operations. Additional objectives include the following:
- The need to ensure that duties to implement such a plan are understood.
- The need to ensure that operational policies are adhered to within all planned activities.
- The need to ensure that proposed contingency arrangements are cost-effective.
- The need to consider implications on other company sites.
- Disaster recovery capabilities as applicable to key customers, vendors and others.
The hosting infrastructure
Sitepass has partnered with Amazon Web Services (AWS) to host and deploy its application. Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the AWS cloud.
Amazon EC2 enables Sitepass to launch as many or as few virtual servers as required, configure security and networking, manage storage, and scale up or down to handle changes in forecast traffic.
Importantly, Amazon meets and exceeds a security and compliance requirements ensuring Sitepass maintains data protection for its clients. You can read about Amazon’s compliance from this link Amazon’s compliance.
Regions and availability zones
Amazon EC2 is hosted in multiple locations world-wide, known as regions. Sitepass is deployed the AWS Sydney region. All data collected and processed from the Sitepass application is retained Australia.
Each Amazon EC2 region is completely isolated from the other Amazon EC2 regions, and within each Amazon region there are multiple, isolated locations known as Availability Zones. Amazon EC2 provides Sitepass with the ability to place resources, such as instances (servers), and data in multiple locations within that region ensuring the highest level of fail over and redundancy in the situation of a disaster.
The Availability Zones in each region are connected through low-latency links. Sitepass utilises both the separation of Regions and Availability Zones within AWS to maintain data protection and implement both disaster recovery and backup.
EC2 Instances
Amazon enables Sitepass to deploy multiple instances (servers) to meet its business requirements and platform needs. The flexibility of Amazon EC2 enables both horizontally and vertically scaling to meet its platform needs, and implement solutions to support performance, disaster recovery and backup requirements.
Horizontal scaling
Horizontal scaling is the process of adding more nodes to a system, such as adding a new server to a distributed software application. Sitepass uses horizontal scaling to extend its platform by adding additional instances to meet its business requirements, but important to meet usage demands.
Vertical scaling
Vertical scaling means to add more resources to a single node within a system, such as increasing the CPU or memory requirements for a single server. Vertical scaling is used to increase performance to align with forecast user traffic and manage performance of its application when under load.
Vertical scaling is the process of increasing the instance type (hardware performance) for each instance on Amazon. Sitepass uses a variety of instance types depending on the purpose and requirements for the application or software that is run on the instance.
The platform
The platform consists of the following virtualised servers:
Production (Primary)
This is the cluster of primary production servers which host the application and database. All user traffic is directed to the platform hosted on these servers. Traffic is evenly spread across the primary production servers, which are located across all 3 availability zones, to ensure redundancy and stability.
Web application firewall (WAF)
The WAF helps secure the application by blocking common web application threats.
Content delivery
Platform uses Amazon’s S3 and CloudFront services to deliver static content such as video, audio and images to users.
Front end web servers
Front end web servers deliver the Sitepass application to the user.
Database
Sitepass uses Amazon’s Aurora managed database platform to host the database, which provides scalability, redundancy and to-the-minute backups. All database data is encrypted at rest.
The three storage options provided by Amazon that are used by Sitepass to store and collect data are:
- Amazon EBS
- Amazon S3
- Amazon Aurora
Beta
This instance is accessible to Sitepass clients to conduct functional and integration development and testing, which is separate from their production (live) environment.
Backup
This instance is responsible for coordinating and storing all backups for the application and database. The Backup instance is hosted in a separate Availability Zone to both the Primary and Failover servers to ensure that the backups of customer data are logically and physically separate from production data and can be accessed in the case of a disaster in another Availability Zone.
Website
This instance hosts the Sitepass website and the product updates site.
Monitoring
This instance is used to host our internal monitoring and logging systems. Metrics logged include:
- Application access logs
- Application uptime, and
- Platform performance and utilisation statistics.
Backup
Backup is an important aspect of the Sitepass hosted platform, and a multi-facet backup solution has been implemented. Backups are implemented within the Amazon platform in the Sydney region. Daily incremental backups of the following have been implemented:
- The application and its database
- Course content
- User uploaded files and assets
- Sitepass corporate website
- All site logs
- Uploaded SCORM courses,
- Infrastructure and instance configuration
Backup instance
The power of Amazon enables Sitepass to implement daily backups using horizontal scaling.
- A dedicated Amazon instance has been setup to manage incremental backups.
- The backup frequency is daily (24 hours)
- All backups are stored within Amazon AWS.
- Backup integrity is verified daily.
- All backups are encrypted at reset
- Backups are retained for a period of 90 days daily, and monthly backups for a period up to 7 years.
Disaster recovery
In the case of a disaster, Sitepass has implemented a multi staged disaster recovery process
Stage 1
The Sitepass application is designed to be resilient in the face of an outage of one or more servers. If a disaster occurs in one Availability Zone hosting the Production servers, end-user traffic is automatically directed to redundant instances in a other Availability Zones. This provides a real time fail over solution in the case of a disaster within one of the availability zones within Amazon.
In the event of failure of one or more web servers, the Sitepass application can automatically provision additional redundant instances to handle end-user traffic.
The Sitepass application also runs multiple primary and secondary database nodes that are synchronized using Amazon Aurora’s replication technology. In the event of failure of one of the primary database nodes, a secondary database node will automatically be promoted to primary with no loss of data.
- The expected recovery time objective (RTO) is less than 1 hour.
- The expected recovery point objective (RPO) is less than 1 hour.
Stage 2
In the unlikely case, that a disaster has impacted, both Availability Zone A and B in the Sydney region, backups will be used to restore to a non-impacted Availability Zone within AWS.
- The expected recovery time objective (RTO) is 48 hours.
- The expected recovery point objective (RPO) is up to 24 hours.
Disaster recovery process
In the event of a significant server failure, the following steps will be followed to address the issue:
- conduct an immediate assessment of the problem
- identify an action plan and develop initial estimate for the rectification
- send communication and an estimated time of rectification to client contacts
- log the issue to the live server status page
- Implement the action plan. Depending on the circumstance this can include:
- setup of an alternate server
- installation of the application
- transfer of files from backup
- system testing – random account and operational procedures
- re-delegation of domains
When there is a change to the circumstances, Sitepass will provide an update to the client contacts detailing the change.
Where domain re-delegation is required, there may be a 12 to 24-hour delay between restoration of system functionality and the system becoming available via the internet, due to the delay in DNS propagation.
Disaster recovery testing and validation
The following activities are performed in order to test the disaster recovery process and validate the integrity of backups:
- Daily – Backups are restored to a simulated environment to ensure integrity of data.
- Monthly – Backups are restored to ensure availability of all data from backup and for verification.
- Quarterly – Stage 1 disaster recovery is tested. The production failover capability is tested by running the application on the Production (Failover) servers during a maintenance period.
- Yearly – Stage 2 disaster recovery is tested. A new simulated Production instance of the application is built from backup in order to test the processes and procedures required to rebuild the application and infrastructure in case of disaster.