vRA 8.x PostgreSQL pods showing down ?
I was doing analysis on a problem where PostgreSQL pods were showing down. vRA 8.x deployments were stuck and not working as expected
In an ideal scenario, there should always be one master and two standby nodes in vRA 8.x
Ideal Status
Node "postgres-0.postgres.prelude.svc.cluster.local":
PostgreSQL version: 10.10
Total data size: 26 GB
Conninfo: host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
Role: standby
WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective)
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 0 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Upstream node: postgres-1.postgres.prelude.svc.cluster.local (ID: 101)
Replication lag: 0 seconds
Last received LSN: 307/595EBA48
Last replayed LSN: 307/595EBA48
Node "postgres-1.postgres.prelude.svc.cluster.local":
PostgreSQL version: 10.10
Total data size: 26 GB
Conninfo: host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
Role: primary
WAL archiving: enabled
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 1 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Replication lag: n/a
Node "postgres-2.postgres.prelude.svc.cluster.local":
PostgreSQL version: 10.10
Total data size: 26 GB
Conninfo: host=postgres-2.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
Role: standby
WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective)
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 0 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Upstream node: postgres-1.postgres.prelude.svc.cluster.local (ID: 101)
Replication lag: 0 seconds
Last received LSN: 304/E1A1B458
Last replayed LSN: 304/E1A1B458
When issue was seen one of the postgres pods was reporting down
postgres-0 1/1 Running 1 3d11h
postgres-1 1/1 Running 0 3d11h
postgres-2 0/1 Running 1 3d11h
By the time we realized, it fixed itself

Was curious to know what happened and how did it fix itself. So let's start our deepdive
Started inspecting postgres-1 which is the current master node
Till 2021-01-29 06:10:19 +0000 UTC , it was detecting that postgres-0 was it's master and all of a sudden at 2021-01-29 06:10:35 +0000 it was unable to ping postgres-0
2021-01-29 06:09:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to ping "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to connect to upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100)
2021-01-29 06:10:35 +0000 UTC [repmgrd] checking state of node 100, 1 of 1 attempts
2021-01-29 06:10:37 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited
. Below snippets show that postgres-1 has been promoted as MASTER, where previous MASTER was postgres-0.
postgres-1 has been promoted as MASTER on 2021-01-29 06:11:06.17188+00
2021-01-29 06:09:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to ping "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to connect to upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100)
2021-01-29 06:10:35 +0000 UTC [repmgrd] checking state of node 100, 1 of 1 attempts
2021-01-29 06:10:37 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited
2021-01-29 06:10:45 +0000 UTC [repmgrd] unable to ping "user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 dbname=repmgr-db host=postgres-0.postgres.prelude.svc.cluster.local fallback_application_name=repmgr"
2021-01-29 06:10:45 +0000 UTC [repmgrd] unable to reconnect to node 100 after 1 attempts
2021-01-29 06:10:45 +0000 UTC [repmgrd] 1 active sibling nodes registered
2021-01-29 06:10:45 +0000 UTC [repmgrd] primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) and this node have the same location ("default")
2021-01-29 06:10:45 +0000 UTC [repmgrd] local node's last receive lsn: 418/C2E8E700
2021-01-29 06:10:45 +0000 UTC [repmgrd] checking state of sibling node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102)
2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) reports its upstream is node 100, last seen 20 second(s) ago
2021-01-29 06:10:45 +0000 UTC [repmgrd] standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) last saw primary node 20 second(s) ago
2021-01-29 06:10:45 +0000 UTC [repmgrd] last receive LSN for sibling node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) is: 418/C2E8E700
2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has same LSN as current candidate "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has lower priority (98) than current candidate "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) (99)
2021-01-29 06:10:45 +0000 UTC [repmgrd] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 2 seconds
2021-01-29 06:10:45 +0000 UTC [repmgrd] promotion candidate is "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 06:10:45 +0000 UTC [repmgrd] executing "failover_validation_command"
2021-01-29 06:10:56 +0000 UTC [repmgrd] no output returned from command
2021-01-29 06:10:56 +0000 UTC [repmgrd] failover validation command returned zero
2021-01-29 06:10:56 +0000 UTC [repmgrd] this node is the winner, will now promote itself and inform other nodes
2021-01-29 06:10:56 +0000 UTC [repmgrd] promote_command is:
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited
*
*
*
2021-01-29 06:11:04 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:11:04 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:11:06 +0000 UTC [repmgr] 1 sibling nodes found, but option "--siblings-follow" not specified
2021-01-29 06:11:06 +0000 UTC [repmgr] promoting standby to primary
2021-01-29 06:11:06 +0000 UTC [postgres] [9-1] 2021-01-29 06:11:06.065 UTC [171] LOG: received promote request
2021-01-29 06:11:06 +0000 UTC [postgres] [7-1] 2021-01-29 06:11:06.065 UTC [175] FATAL: terminating walreceiver process due to administrator command
2021-01-29 06:11:06 +0000 UTC [postgres] [10-1] 2021-01-29 06:11:06.066 UTC [171] LOG: invalid resource manager ID 47 at 418/C2E8E700
2021-01-29 06:11:06 +0000 UTC [postgres] [11-1] 2021-01-29 06:11:06.066 UTC [171] LOG: redo done at 418/C2E8E6D8
2021-01-29 06:11:06 +0000 UTC [postgres] [12-1] 2021-01-29 06:11:06.066 UTC [171] LOG: last completed transaction was at log time 2021-01-29 06:10:25.633443+00
2021-01-29 06:11:06 +0000 UTC [postgres] [13-1] 2021-01-29 06:11:06.076 UTC [171] LOG: selected new timeline ID: 6
2021-01