vPostgres Database Failover Scenarios
Updated: Feb 7, 2021
Master Database constantly streams transactions to Replica servers
Once Master fails, the active and working Replica is ready to proceed with Read-Only requests right away.When the new Master is promoted ( manually or automatically ), all upcoming requests are moved to it.
Up to vRA 7.2 release , manual failover was supported which means that the MASTER is replaced with a Replica selected by Administrator
Administrator needs to trigger the promotion operation and wait for its completion
With vRA 7.3, PostgreSQL database failover is automatic, when three appliances are deployed and synchronous replication is configured
Failover scenarios does not differ from the manual failover scenarios except parts of the scenarios with the steps needed to recover the environment are performed automatically, when automatic failover is on.
It is the ability of the vRA PostgreSQL database component to automatically sustain a Master node alive at any time without further intervention of Administrator.
There are some cases that vRA is unable to perform an automatic failover, for example if more than half of the nodes are off at the same time.Monitoring the https://vra-fqdn:5434/api/status when in sync mode ( automatic failover turned on ) as specified in documentation will notify the client to perform manual failover when the manualFailoverNeeded status flag is true
3 Node ( vRA Appliances ) Failover Scenarios
As you can see in the screenshot we do see three nodes
One being MASTER , other being SYNC REPLICA and the last one is POTENTIAL REPLICA
For systems that use synchronous replication, there is one replica node completely in sync with the MASTER known as SYNC REPLICA. In the above screenshot it's vra3.vcloud.local which is SYNC REPLICA
The SYNC REPLICA which is also called as HOT-SYNC node will be the one who would be promoted as MASTER as soon as current MASTER state would be changed to DOWN
Only SYNC REPLICA will be the one in complete SYNC with MASTER node , other's status would be set to POTENTIAL.
In ASYNCHRONOUS replication mode all node's have Sync State set to ASYNC. MASTER does not have any Sync State value.
The Valid column in the screenshot shows that the database instance in the node is healthy
The Priority column shows the position of Replica in relation to the Master node. The Master node has no priority.
SYNC REPLICA's priority would be set to 0 as it's the one next in line to become MASTER. Subsequent POTENTIAL REPLICA's would be having priority set to 1, 2.... and so on
Three Node ( vRA Appliances ) Auto Failover Scenarios
When one of three nodes is down , the AutoFailover will be performed. No more AutoFailover capability till all three nodes are alive
Let's explore few scenarios
Scenario 1: MASTER is DOWN
SYNC REPLICA becomes MASTER and picks up the database function automatically ( performed by automatic failover agent )
POTENTIAL REPLICA is now a new SYNC REPLICA
vRA appliances get into READ ONLY mode till automatic failover completes.
Once former MASTER is back
It would be set to REPLICA automatically by failover agent. No manual action is needed.
If former MASTER cannot be fixed, set the PostgreSQL database to asynchronous mode
Scenario 2: SYNC REPLICA is DOWN
There would be no change to MASTER, hence no downtime.
You may experience some delay in the database requests for a couple of seconds until Potential Replica becomes the new SYNC REPLICA
Promotion of nodes is automatically done by the PostgreSQL service
Once former SYNC REPLICA is back online
It would be automatically set to POTENTIAL REPLICA
Scenario 3: POTENTIAL REPLICA is DOWN
Nothing. There is no downtime
When former POTENTIAL REPLICA comes back online
It will automatically be set to POTENTIAL REPLICA
If former POTENTIAL REPLICA does not come online then set the PostgreSQL database to asynchronous mode.
When two out of three nodes are DOWN at the same time, the system starts working in a read only mode until a manual repair is performed. When just one Database on node is available in the system, it should be switched to asynchronous mode.
Scenario 4: MASTER AND POTENTIAL REPLICA are DOWN
SYNC REPLICA will not be prompted as MASTER automatically
System starts working in a read only mode until the manual promotion is performed
Once Former MASTER and POTENTIAL REPLICA are back online
They should be reset manually to synchronize against MASTER
Then the replication should be turned back to SYNCHRONOUS mode
Scenario 5: SYNC REPLICA and POTENTIAL REPLICA are DOWN
MASTER is not able to process R/W transactions
System starts working in READ ONLY mode until a manual repair is performed
When the SYNC REPLICA and POTENTIAL REPLICA are back
They must be reset manually to synchronize against the MASTER
The system should be turned back to SYNCHRONOUS mode
When link failure happens between NODES
Scenario 6: Link Failure between SITE A and SITE B, but all 3 nodes are UP and ONLINE
SITE A has MASTER and POTENTIAL REPLICA
SITE B has SYNC REPLICA
POTENTIAL REPLICA on SITE A becomes SYNC REPLICA
There is no downtime
When the link is re-established no manual action is needed
Scenario 7: Link failure between SITE A and SITE B and all three nodes are UP and ONLINE
SITE A has MASTER
SITE B has POTENTIAL REPLICA
SYNC REPLICA becomes MASTER and picks up the DB function automatically, and POTENTIAL REPLICA becomes the new SYNC REPLICA (performed by the automatic failover agent).
System is working in a read only mode until this promotion finishes.
When the link is re-established , former MASTER is reset as REPLICA.
There is no MANUAL action needed
vRA Database Configuration Files and it's log locations
vPostgres configuration file: /var/vmware/vpostgres/current/pgdata/pg_hba.conf
vPostgreslog: /storage/db/pgdata/pg_log/postgresql.csv, /storage/db/pgdata/serverlog
Automatic Failover Agent log: /var/log/vcac/vcac-config.log